Can you share what LLMs do you run on such small devices/what user case they address?
(Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)
Excellent and accurate q. You sound like the first person I've talked to who might appreciate full exposition here, apologies if this is too much info. TL;DR is you're def not missing anything, and we're just beginning to turn a corner and see some rays of light of hope, where it's a genuine substitute for remote models in consumer applications.
#1) I put a lot of effort into this and, quite frankly, it paid off absolutely 0 until recently.
#2) The "this" in "I put a lot of effort into this", means, I left Google 1.5 years ago and have been quietly building an app that is LLM-agnostic in service of coalescing a lot of nextgen thinking re: computing I saw that's A) now possible due to LLMs B) was shitcanned in 2020, because Android won politically, because all that next-gen thinking seemed impossible given it required a step change in AI capabilities.
This app is Telosnex (telosnex.com).
I have a couple stringent requirements I enforce on myself, it has to run on every platform, and it has to support local LLMs just as well as paid ones.
I see that as essential for avoiding continued algorithmic capture of the means of info distribution, and believe on a long enough timeline, all the rushed hacking people have done to llama.cpp to get model after model supported will give away to UX improvements.
You are completely, utterly, correct to note that the local models on device are, in my words, useless toys, at best. In practice, they kill your battery and barely work.
However, things did pay off recently. How?
#1) llama.cpp landed a significant opus of a PR by @ochafik that normalized tool handling across models, as well as implemented what the models need individually for formatting
#2) Phi-4 mini came out. Long story, but tl;dr: till now there's been various gaping flaws with each Phi release. This one looked absent of any issues. So I hack support for its tool vagaries on top of what @ochafik landed, and all of a sudden I'm seeing the first local model sub-Mixtral 8x7B that's reliably handling RAG flows (i.e. generate search query, then, accept 2K tokens of parsed web pages and answer a q following directions I give you) and tool calls (i.e. generate search query, or file operations like here: https://x.com/jpohhhh/status/1897717300330926109)
Are there any good how-tos for how to set up a non-trivial container with s6 and s6-rc? Last time I looked at this the documentation was pretty sparse, and more of a reference and design document than a set of how-tos.
There's a huge community https://www.linuxserver.io/ for people building "home server" containers that all use the s6-overlay, hundreds of examples there. They have a lot of tutorials and a very busy Discord, Reddit, etc with all levels of experience from container developers to people who don't program and are just getting into Docker. I run a bunch of these containers myself and am pretty happy with how adaptable they are.
I believe they mean that since it bypasses the (Tokio) scheduler, so if you use it in async code you lose the main benefit of async code (namely, the scheduler is able to switch to some other task while waiting for IO to complete.).
Basically the same behavior you'd get if you called a blocking syscall directly.
I don't see how that would work. The memory access causing the page fault still blocks, but now another thread handles paging in the requested data. So without coordination between those two, nothing really changes. Sounds easier to just use nonblocking reads directly.
Thanks for the pointer to userfaultfd. Didn't know that existed.
Yeah. Part of the problem is that userfaultfd isn't itself quite flexible enough. What you might want to do is release the faulted thread to do some other work, letting it "know" what it can come back later when the data is available, but there's no mechanism to make that happen. Instead it's going to be entirely blocked until the fault can be resolved.
Sure, but how well do they perform compared to vector loads? Do they get converted to vector load + shuffle uops, and therefore require a specific layout anyway?
Last time I tried using gathers on AVX2, performance was comparable to doing scalar loads.
Gathers on AVX2 used to be problematic, but assume it shouldn't be the case today especially if the lane-crossing is minimal? (if you do know, please share!)
Gather is still terrible, the only core that handles it well is the Intel's P core. AMD issues 40+ micro ops in AVX2(80 in AVX512), and the Intel E core is much worse.
When using SIMD you must either use SoA or AoSoA for optimal performance. You can sometimes use AoS if you have a special hand coded swizzle loader for the format.
Do you know of any resources on such swizzle loaders? I've toyed around with hand-coding x86 SIMD myself, and getting everything horizontally in the right place is always a pain.
General questions for gamedevs here.
How useful is SIMD given that now we have compute shaders on the GPU? If so, what workloads still require SIMD/why would you choose one over the other?
Specifically physics benefits from CPU processing. Efficient rendering pipelines are typically one-way (CPU -> GPU), whereas the results of physics calculations are depended on both by the game logic and rendering, and it's much simpler (and probably more efficient) to keep that computation on the CPU. The exception to this is could be on UMA architectures like the Apple M-series and the PS4, where memory transport isn't a limiting factor – though memory/cache invalidation might be an issue?
Yeah, that's why I qualified with 'could'. Really depends on what facilities the hardware and driver provide. If the GPU is on the same die, perhaps the latency isn't great, but I really don't have the data on that. But I'd really like to see something like voxel deformable/destructible environments leveraging UMA on the Apple M. Seems like that something that would be groundbreaking, if only Apple really cared about gaming at all.
With graphics you mostly prepare everything you want to render and then transfer all of it to the GPU. Physics still lends itself fairly well to GPU acceleration as well (compared to other things), but simply preparing something, transferring it to the GPU and being done is not enough. You need to at least get it back, even just to render it, but likely also to have gameplay depend on it. And with graphics programming the expensive part is often the communication between the CPU and the GPU and trying to avoid synchronization (especially with the old graphics APIs), so transferring there and back is expensive. Also physics code is full of branches, while graphics usually is not. GPUs (or rather really wide vectorization generally) don't like branches much and if you do only certain parts of the physics simulation on the GPU, then you need to transfer there and back (and synchronize) even more. I'm just a hobby gamedev and I know that people have done physics on the GPU (PhysX), but to me the things I mentioned sound like big hurdles.
EDIT: one more big thing is also that at least for AAA games you want to keep the GPU doing graphics so it looks good. You usually never have GPU cycles to spare.
I'm not a gamedev, but I do a lot of numerical work. GPUs are great, but they're no replacement for SIMD.
For example, I just made a little example on my desktop where I summed up 256 random Float32 numbers, and doing it in serial takes around 152 nanoseconds, whereas doing it with SIMD took just 10 nanoseconds. Doing the exact same thing with my GPU took 20 microseconds, so 2000x slower:
julia> using CUDA, SIMD, BenchmarkTools
julia> function vsum(::Type{Vec{N, T}}, v::Vector{T}) where {N, T}
s = Vec{N, T}(0)
lane = VecRange{N}(0)
for i ∈ 1:N:length(v)
s += v[lane + i]
end
sum(s)
end;
julia> let L = 256
print("Serial benchmark: "); @btime vsum(Vec{1, Float32}, v) setup=(v=rand(Float32, $L))
print("SIMD benchmark: "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
print("GPU benchmark: "); @btime sum(v) setup=(v=CUDA.rand($L))
end;
Serial benchmark: 152.239 ns (0 allocations: 0 bytes)
SIMD benchmark: 10.359 ns (0 allocations: 0 bytes)
GPU benchmark: 19.917 μs (56 allocations: 1.47 KiB)
The reason for that is simply that it just takes that long to send data back and forth to the GPU and launch a kernel. Almost none of that time was actually spent doing the computation. E.g. here's what that benchmark looks like if instead I have 256^2 numbers:
so we're now at the point where the GPU is faster than serial, but still slower than SIMD. If we go up to 256^3 numbers, now we're able to see a convincing advantage for the GPU:
So the lesson here is that GPUs are only worth it if you actually have enough data to saturate the GPU, but otherwise you're way better off using SIMD.
GPUs are also just generally a lot more limiting than SIMD in many other ways.
Here's a few random limitations I can think of other than those already mentioned:
* Float64 math is typically around 30x slower than Float32 math on "consumer-grade" GPUs due to an arbitrary limitation to stop people from using consumer grade chips for "workstation" purposes. This turns out to not be a big deal for things like machine learning, but lots of computational processes actually are rather sensitive to rounding errors and benefit a lot from using 64 bit numbers, which is very slow on GPUs.
* Writing GPU specific functions can be quite labour intensive compared to writing CPU code. Julia's CUDA.jl and KernelAbstractions.jl packages does make a lot of things quite a bit nicer than in most languages, but it's still a lot of work to write good GPU code.
* Profiling and understanding the performance of GPU programs is typically a lot more complicated than CPU programs (even if there are some great tools for it!) because the performance model is just fundamentally more complex with more stuff going on and more random pitfalls and gotchas.
On x86-64, compilers use SIMD instructions and registers to implement floating point math, they just use the single lane instructions. E.g. (https://godbolt.org/z/94b3r8dMn):
no but it's super super janky and simple hodgepode of stack overflow and gemma:27b generated code, i'll just put it in the comment here, you just need CURL on your path and vim that's compiled with some specific flag
function! GetSurroundingLines(n)
let l:current_line = line('.')
let l:start_line = max([1, l:current_line - a:n])
let l:end_line = min([line('$'), l:current_line + a:n])
let l:lines_before = getline(l:start_line, l:current_line - 1)
let l:lines_after = getline(l:current_line + 1, l:end_line)
return [l:lines_before, l:lines_after]
endfunction
function! AIComplete()
let l:n = 256
let [l:lines_before, l:lines_after] = GetSurroundingLines(l:n)
let l:prompt = '<PRE>' . join(l:lines_before, "\n") . ' <SUF>' . join(l:lines_after, "\n") . ' <MID>'
let l:json_data = json_encode({
\ 'model': 'codellama:13b-code-q6_K',
\ 'keep_alive': '30m',
\ 'stream': v:false,
\ 'prompt': l:prompt
\ })
let l:response = system('curl -s -X POST -H "Content-Type: application/json" -d ' . shellescape(l:json_data) . ' http://localhost:11434/api/generate')
let l:completion = json_decode(l:response)['response']
let l:paste_mode = &paste
set paste
execute "normal! a" . l:completion
let &paste = l:paste_mode
endfunction
nnoremap <leader>c :call AIComplete()<CR>
You still have to wait for your cache to reload from main memory, or for disk or network I/O, or for processes to be scheduled to run, so while it's likely more efficient than epoll approaches, I doubt there's any really fundamental difference in the performance problems you would find.
Can you share what LLMs do you run on such small devices/what user case they address?
(Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)