More

xoranth · 2025-03-06T07:56:52 1741247812

> Crappy Pixel Fold 2022 mid-range Android CPU

Can you share what LLMs do you run on such small devices/what user case they address?

(Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)

refulgentis · 2025-03-06T20:42:47 1741293767

Excellent and accurate q. You sound like the first person I've talked to who might appreciate full exposition here, apologies if this is too much info. TL;DR is you're def not missing anything, and we're just beginning to turn a corner and see some rays of light of hope, where it's a genuine substitute for remote models in consumer applications.

#1) I put a lot of effort into this and, quite frankly, it paid off absolutely 0 until recently.

#2) The "this" in "I put a lot of effort into this", means, I left Google 1.5 years ago and have been quietly building an app that is LLM-agnostic in service of coalescing a lot of nextgen thinking re: computing I saw that's A) now possible due to LLMs B) was shitcanned in 2020, because Android won politically, because all that next-gen thinking seemed impossible given it required a step change in AI capabilities.

This app is Telosnex (telosnex.com).

I have a couple stringent requirements I enforce on myself, it has to run on every platform, and it has to support local LLMs just as well as paid ones.

I see that as essential for avoiding continued algorithmic capture of the means of info distribution, and believe on a long enough timeline, all the rushed hacking people have done to llama.cpp to get model after model supported will give away to UX improvements.

You are completely, utterly, correct to note that the local models on device are, in my words, useless toys, at best. In practice, they kill your battery and barely work.

However, things did pay off recently. How?

#1) llama.cpp landed a significant opus of a PR by @ochafik that normalized tool handling across models, as well as implemented what the models need individually for formatting

#2) Phi-4 mini came out. Long story, but tl;dr: till now there's been various gaping flaws with each Phi release. This one looked absent of any issues. So I hack support for its tool vagaries on top of what @ochafik landed, and all of a sudden I'm seeing the first local model sub-Mixtral 8x7B that's reliably handling RAG flows (i.e. generate search query, then, accept 2K tokens of parsed web pages and answer a q following directions I give you) and tool calls (i.e. generate search query, or file operations like here: https://x.com/jpohhhh/status/1897717300330926109)

xoranth · on Sept 16, 2024

Are there any good how-tos for how to set up a non-trivial container with s6 and s6-rc? Last time I looked at this the documentation was pretty sparse, and more of a reference and design document than a set of how-tos.

jitl · on Sept 16, 2024

This is well thought-out and quite well documented: https://github.com/just-containers/s6-overlay

There's a huge community https://www.linuxserver.io/ for people building "home server" containers that all use the s6-overlay, hundreds of examples there. They have a lot of tutorials and a very busy Discord, Reddit, etc with all levels of experience from container developers to people who don't program and are just getting into Docker. I run a bunch of these containers myself and am pretty happy with how adaptable they are.

xoranth · on Aug 24, 2024

I believe they mean that since it bypasses the (Tokio) scheduler, so if you use it in async code you lose the main benefit of async code (namely, the scheduler is able to switch to some other task while waiting for IO to complete.). Basically the same behavior you'd get if you called a blocking syscall directly.

xoranth · on Aug 24, 2024

On Linux, you might be able to use userfaultfd to make it async...

dividuum · on Aug 24, 2024

I don't see how that would work. The memory access causing the page fault still blocks, but now another thread handles paging in the requested data. So without coordination between those two, nothing really changes. Sounds easier to just use nonblocking reads directly.

Thanks for the pointer to userfaultfd. Didn't know that existed.

cmrdporcupine · on Aug 24, 2024

Yeah. Part of the problem is that userfaultfd isn't itself quite flexible enough. What you might want to do is release the faulted thread to do some other work, letting it "know" what it can come back later when the data is available, but there's no mechanism to make that happen. Instead it's going to be entirely blocked until the fault can be resolved.

xoranth · on Aug 22, 2024

Sure, but how well do they perform compared to vector loads? Do they get converted to vector load + shuffle uops, and therefore require a specific layout anyway?

Last time I tried using gathers on AVX2, performance was comparable to doing scalar loads.

neonsunset · on Aug 22, 2024

They are pretty good: https://dougallj.github.io/applecpu/measurements/firestorm/L...

Gathers on AVX2 used to be problematic, but assume it shouldn't be the case today especially if the lane-crossing is minimal? (if you do know, please share!)

TinkersW · on Aug 22, 2024

Gather is still terrible, the only core that handles it well is the Intel's P core. AMD issues 40+ micro ops in AVX2(80 in AVX512), and the Intel E core is much worse.

When using SIMD you must either use SoA or AoSoA for optimal performance. You can sometimes use AoS if you have a special hand coded swizzle loader for the format.

LegionMammal978 · on Aug 23, 2024

Do you know of any resources on such swizzle loaders? I've toyed around with hand-coding x86 SIMD myself, and getting everything horizontally in the right place is always a pain.

TinkersW · on Aug 23, 2024

You can often find them by search stack overflow(try AVX2 + (deinterleave/AoS to SoA/transpose etc), especially any answers by Peter Cordes.

You can write them yourself also but I'd add a verifier that checks the output with scalar code as it can be tricky to get correct.

Intel had an article up for the 3x8 transpose, but it seems to no longer exist so i'll just post the psuedo code

   //xyz -> xxx 
 void swizzle3_AoS_to_SoA(v8float &x, v8float &y, v8float &z) {
  v8float m14 = interleave_low_high<1, 2>(x, z); //swap low/high 128 bits
  v8float m03 = blend<0, 0, 0, 0, 1, 1, 1, 1>(x, y); //_mm256_blend_ps 1 cycle
  v8float m25 = blend<0, 0, 0, 0, 1, 1, 1, 1>(y, z);

  //shuffles are all 1 cycle
  __m256 xy = _mm256_shuffle_ps(m14, m25, _MM_SHUFFLE(2, 1, 3, 2)); // upper x's and y's 
  __m256 yz = _mm256_shuffle_ps(m03, m14, _MM_SHUFFLE(1, 0, 2, 1)); // lower y's and z's
  v8float xo = _mm256_shuffle_ps(m03, xy, _MM_SHUFFLE(2, 0, 3, 0));
  v8float yo = _mm256_shuffle_ps(yz, xy, _MM_SHUFFLE(3, 1, 2, 0));
  v8float zo = _mm256_shuffle_ps(yz, m25, _MM_SHUFFLE(3, 0, 3, 1));
  x = xo;
  y = yo;
  z = zo;

 }

xoranth · on Aug 22, 2024

General questions for gamedevs here. How useful is SIMD given that now we have compute shaders on the GPU? If so, what workloads still require SIMD/why would you choose one over the other?

h0l0cube · on Aug 22, 2024

Specifically physics benefits from CPU processing. Efficient rendering pipelines are typically one-way (CPU -> GPU), whereas the results of physics calculations are depended on both by the game logic and rendering, and it's much simpler (and probably more efficient) to keep that computation on the CPU. The exception to this is could be on UMA architectures like the Apple M-series and the PS4, where memory transport isn't a limiting factor – though memory/cache invalidation might be an issue?

eigenspace · on Aug 22, 2024

Even with UMA architectures where you eliminate the memory transport costs, it still costs a ton of time to actually launch a GPU kernel from the CPU.

h0l0cube · on Aug 22, 2024

Yeah, that's why I qualified with 'could'. Really depends on what facilities the hardware and driver provide. If the GPU is on the same die, perhaps the latency isn't great, but I really don't have the data on that. But I'd really like to see something like voxel deformable/destructible environments leveraging UMA on the Apple M. Seems like that something that would be groundbreaking, if only Apple really cared about gaming at all.

dxuh · on Aug 22, 2024

With graphics you mostly prepare everything you want to render and then transfer all of it to the GPU. Physics still lends itself fairly well to GPU acceleration as well (compared to other things), but simply preparing something, transferring it to the GPU and being done is not enough. You need to at least get it back, even just to render it, but likely also to have gameplay depend on it. And with graphics programming the expensive part is often the communication between the CPU and the GPU and trying to avoid synchronization (especially with the old graphics APIs), so transferring there and back is expensive. Also physics code is full of branches, while graphics usually is not. GPUs (or rather really wide vectorization generally) don't like branches much and if you do only certain parts of the physics simulation on the GPU, then you need to transfer there and back (and synchronize) even more. I'm just a hobby gamedev and I know that people have done physics on the GPU (PhysX), but to me the things I mentioned sound like big hurdles.

EDIT: one more big thing is also that at least for AAA games you want to keep the GPU doing graphics so it looks good. You usually never have GPU cycles to spare.

eigenspace · on Aug 22, 2024

I'm not a gamedev, but I do a lot of numerical work. GPUs are great, but they're no replacement for SIMD.

For example, I just made a little example on my desktop where I summed up 256 random Float32 numbers, and doing it in serial takes around 152 nanoseconds, whereas doing it with SIMD took just 10 nanoseconds. Doing the exact same thing with my GPU took 20 microseconds, so 2000x slower:

    julia> using CUDA, SIMD, BenchmarkTools

    julia> function vsum(::Type{Vec{N, T}}, v::Vector{T}) where {N, T}
               s = Vec{N, T}(0)
               lane = VecRange{N}(0)
               for i ∈ 1:N:length(v)
                   s += v[lane + i]
               end
               sum(s)
           end;

    julia> let L = 256
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    152.239 ns (0 allocations: 0 bytes)
    SIMD benchmark:      10.359 ns (0 allocations: 0 bytes)
    GPU benchmark:       19.917 μs (56 allocations: 1.47 KiB)

The reason for that is simply that it just takes that long to send data back and forth to the GPU and launch a kernel. Almost none of that time was actually spent doing the computation. E.g. here's what that benchmark looks like if instead I have 256^2 numbers:

    julia> let L = 256^2
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    42.370 μs (0 allocations: 0 bytes)
    SIMD benchmark:      2.669 μs (0 allocations: 0 bytes)
    GPU benchmark:       27.592 μs (112 allocations: 2.97 KiB)

so we're now at the point where the GPU is faster than serial, but still slower than SIMD. If we go up to 256^3 numbers, now we're able to see a convincing advantage for the GPU:

    julia> let L = 256^3
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    11.024 ms (0 allocations: 0 bytes)
    SIMD benchmark:      2.061 ms (0 allocations: 0 bytes)
    GPU benchmark:       353.119 μs (113 allocations: 2.98 KiB)

So the lesson here is that GPUs are only worth it if you actually have enough data to saturate the GPU, but otherwise you're way better off using SIMD.

GPUs are also just generally a lot more limiting than SIMD in many other ways.

xoranth · on Aug 22, 2024

Thank you for your reply!

> GPUs are also just generally a lot more limiting than SIMD in many other ways.

What do you mean? (besides things like CUDA being available only on Nvidia/fragmentation issues.)

eigenspace · on Aug 22, 2024

Here's a few random limitations I can think of other than those already mentioned:

* Float64 math is typically around 30x slower than Float32 math on "consumer-grade" GPUs due to an arbitrary limitation to stop people from using consumer grade chips for "workstation" purposes. This turns out to not be a big deal for things like machine learning, but lots of computational processes actually are rather sensitive to rounding errors and benefit a lot from using 64 bit numbers, which is very slow on GPUs.

* Writing GPU specific functions can be quite labour intensive compared to writing CPU code. Julia's CUDA.jl and KernelAbstractions.jl packages does make a lot of things quite a bit nicer than in most languages, but it's still a lot of work to write good GPU code.

* Profiling and understanding the performance of GPU programs is typically a lot more complicated than CPU programs (even if there are some great tools for it!) because the performance model is just fundamentally more complex with more stuff going on and more random pitfalls and gotchas.

xoranth · on Aug 22, 2024

On x86-64, compilers use SIMD instructions and registers to implement floating point math, they just use the single lane instructions. E.g. (https://godbolt.org/z/94b3r8dMn):

    float my_func(float lhs, float rhs) {
        return 2.0f * lhs - 3.0f * rhs;
    }

Becomes:

    my_func(float, float):
        addss   xmm0, xmm0
        mulss   xmm1, DWORD PTR .LC0[rip]
        subss   xmm0, xmm1
        ret

(addss, mulss and subss are SSE2 instructions.)

xoranth · on July 24, 2024

Sounds a bit like Google's proposal for a `switchto_switch` syscall [1] that would allow for cooperative multithreading bypassing the scheduler.

(the descendants of that proposal is `sched_ext`, so maybe it is possible to implement doors in eBPF + sched_ext?)

[1]: https://youtu.be/KXuZi9aeGTw?t=900

xoranth · on July 17, 2024

Is the extension you wrote public?

thot_experiment · on July 17, 2024

no but it's super super janky and simple hodgepode of stack overflow and gemma:27b generated code, i'll just put it in the comment here, you just need CURL on your path and vim that's compiled with some specific flag

    function! GetSurroundingLines(n)
        let l:current_line = line('.')
        let l:start_line = max([1, l:current_line - a:n])
        let l:end_line = min([line('$'), l:current_line + a:n])
        
        let l:lines_before = getline(l:start_line, l:current_line - 1)
        let l:lines_after = getline(l:current_line + 1, l:end_line)
        
        return [l:lines_before, l:lines_after]
    endfunction
    
    function! AIComplete()
        let l:n = 256
        let [l:lines_before, l:lines_after] = GetSurroundingLines(l:n)
        
        let l:prompt = '<PRE>' . join(l:lines_before, "\n") . ' <SUF>' . join(l:lines_after, "\n") . ' <MID>'
        
        let l:json_data = json_encode({
            \ 'model': 'codellama:13b-code-q6_K',
            \ 'keep_alive': '30m',
            \ 'stream': v:false,
            \ 'prompt': l:prompt
        \ })
        
        let l:response = system('curl -s -X POST -H "Content-Type: application/json" -d ' . shellescape(l:json_data) . ' http://localhost:11434/api/generate')
    
        let l:completion = json_decode(l:response)['response']
        let l:paste_mode = &paste
        set paste
        execute "normal! a" . l:completion
        let &paste = l:paste_mode
    endfunction
    
    nnoremap <leader>c :call AIComplete()<CR>

xoranth · on July 17, 2024

Thanks!

xoranth · on July 16, 2024

How would this interact with `io_uring`, especially the polling modes (IO_SETUP_SQPOLL, IO_SETUP_IOPOLL)?

rwmj · on July 16, 2024

You still have to wait for your cache to reload from main memory, or for disk or network I/O, or for processes to be scheduled to run, so while it's likely more efficient than epoll approaches, I doubt there's any really fundamental difference in the performance problems you would find.