Ask HN: Cheapest way to run local LLMs?

roosgit · on Nov 27, 2023

Back in April I bought some parts to build a PC for testing LLMs with llama.cpp. I paid around $192 for: a B550MH motherboard, AMD Ryzen 3 4100, 1x16GB DDR4 Kingston ValueRAM, 256GB M.2 SSD. I already had an old PC case with a 350W PSU.

I was getting 2.2 tokens/s with the llama-2-13b-chat.Q4_K_M.gguf and 3.3 tokens/s with llama-2-13b-chat.Q3_K_S.gguf. With Mistral and Zephyr, the Q4_K_M versions, I was getting 4.4 tokens/s.

A few days ago I bought another stick of 16GB RAM ($30) and for some reason that escapes me, the inference speed doubled. So now I'm getting 6.5 tokens/s with llama-2-13b-chat.Q3_K_S.gguf, which for my needs gives the same results as Q4_K_M, and 9.1 tokens/s with Mistral and Zephyr. Personally, I can barely keep up with reading at 9 tokens/s (if I also have to process the text and check for errors).

If I wasn't considering getting an Nvidia 4060 Ti for Stable Diffusion, I would seriously be considering a used RX 580 8GB ($75) and run Llama Q4_K_M entirely on the GPU or offload some layers when using a 30B model.

kristianp · on Nov 27, 2023

Cpus often have 2 ram channels, you need 2 sticks to get the full memory bandwidth out of the processor. Inference is very memory intensive, so it makes sense that the perf doubled.

aitchnyu · on Nov 27, 2023

What happens to performance if I have an 8 gb (soldered to laptop mainboard) and 32 gb dimm?

scrlk · on Nov 27, 2023

It will run in flex mode: 8 GB soldered + 8 GB on the DIMM in dual channel, 16 GB on the DIMM will run in single channel.

This is slower than dual channel.

aitchnyu · on Nov 28, 2023

You mean 24 instead of 16 right?

scrlk · on Nov 28, 2023

Whoops - yeah, it's 24 GB in single channel mode.

potsandpans · on Nov 27, 2023

Do you know of a good reference / primer for LLMs from a technical architecture perspective? I've been somewhat avoiding them, but after seeing MonadGPT -- I'm just too damn curious.

Ideally, I'd like to be able to have a "survey level" understanding of what goes into scaling these models, and what they're capable of at different levels of scale. For example, in the "introducing llama" page, they say

> Smaller, more performant models such as LLaMA enable others in the research community who don’t have access to large amounts of infrastructure to study these models, further democratizing access in this important, fast-changing field.

I'd like to be able to somewhat intelligently be able to discuss the tradeoffs here. What exactly does "smaller, more performant" mean in this context and how can we quantify the differences between models that demand larger infrastructure.

davikr · on Nov 27, 2023

I'd recommend against going with an AMD GPU if you plan to run models. Support is always more spotty than on NVIDIA.

MandieD · on Nov 27, 2023

For a 13B model, depending on what quantization you choose, you’re going to need a system with at least 16GB RAM, and even my AMD Ryzen 7 5800X at full throttle feels a bit sluggish on 7B Llama - a Raspberry Pi or similar would be painful.

Here is the best explanation I’ve found so far, covering various trade-offs and scenarios: https://www.hardware-corner.net/guides/computer-to-run-llama...

In your shoes, not being in the position to spend much right now, I’d try a few different 7B models at 4 and 5 bit quantizations on the Mac, which is going to be better than just about any other 8GB RAM system, and look into using cloud for larger stuff (remember to fully deallocate the VM when done for the day!)

pizza · on Nov 26, 2023

Here's a simple calculator for LLM inference requirements: https://rahulschand.github.io/gpu_poor/

gorbypark · on Nov 27, 2023

Unfortunately a 13B model quantized down to even 2 or 3 bits won't fit into 8gb of ram. You might have to settle for 7B models. I'm currently getting great results out of the Mistral-Openorca-7B model. At 5 bits it's using about 5.5gb of ram and running at 14 tokens/sec on my M2 MacBook Air (24GB ram, but in theory would work on 8gb if MacOS will allow you to allocate that much ram to the GPU). As a quick test, forcing it into CPU mode, I'm still getting 11 tokens/sec. It does seem to take about 2-3 times longer to initialize itself and get into the "ready state" when using the CPU, however.

calgoo · on Nov 27, 2023

I have been running some 7b models on my 8gb MacBook Air m2. It works, as long as you don’t want to do anything else at the same time. It brings my system down to a crawl, but it works, and having cli ollama binary is enough atm.

svjatoslav · on Nov 27, 2023

Clone llama.cpp project from GitHub. Download LLM models from HuggingFace. TheBloke user posts lot of models in GGUF format. GPU is not needed to run these. 13B models should be fast modern computer CPU. llama.cpp offers batch mode, interactive chat mode and also web server mode.

instagib · on Nov 27, 2023

Here’s an m1 guide I found. https://simonwillison.net/2023/Aug/1/llama-2-mac/

Start with a 7B model then go from there. I used kobold ai but that didn’t seem too well recommended for macOS.

Raspberry pi 4B can do 3B models or 7B at one question per hour or so for now. Can quantize them for faster but then the answers are worse.

Casteil · on Nov 27, 2023

There's an easier way to get LLMs running with much less hassle if you're on MacOS: Ollama - https://ollama.ai/

nicbou · on Nov 28, 2023

This is the best way to start for sure. It was trivial to get to a prompt and play with a limited set of models.

eigenvalue · on Nov 28, 2023

Depends what you mean by "local". If you mean in your own home, then there isn't a particularly cheap way unless you have a decent spare machine. If you mean "I get to control everything myself" then you can rent a cheap VPS on a value host like Contabo (you can get 8cores, 30gb of ram, and 1tb SSD on Ubuntu 22.04 for something like $35/month-- just stick the to US data centers).

Then if you want something that is extremely quick and easy to set up and provides a convenient REST api for completions/embeddings with some other nice features, you might want to check out my project here:

https://github.com/Dicklesworthstone/swiss_army_llama

Especially if you use Docker to set it up, you can go from a brand new box to a working setup in under 20 minutes and then access it via the Swagger page from any browser.

mark_l_watson · on Nov 28, 2023

You need lots of memory. 16G will easily run 7B quantized models.

Not the cheapest by far, but I recently bought a 32G internal memory M2 Pro Mac mini. I can run about four 7B models concurrently. I was able to run a 30B quantized model without page faults, but I killed most user land processes.

Also not what you are asking for, but I pay Google $10/month for Colab Pro and I can usually get an A100 whenever I request one. Between Colab and my 32G M2 box, I am very satisfied. Before I found good quantized models to run, I would rent a VPS by the hour from Lambda Labs, and that was a great experience, but I don’t need to do that now.

EDIT: on the M2 Pro, I get 25 to 30 tokens per second.

EDIT #2: I wrote a short blog yesterday on the best resources I have found so far for running on my Mac https://marklwatson.substack.com/p/running-open-llm-models-o...

moralestapia · on Nov 27, 2023

Hm, it is unclear to me if you plan to use some PIs or your Mac M1.

In case it's the latter, I recently used Ollama[1] and boy was it good! Installation was a breeze, downloading/using models is very easy and performance on my M1 was quite good for the Mistral 7B model.

1: https://ollama.ai/

brianjking · on Nov 27, 2023

What is a decent tok/s?

Your best bet is to run a quantized 7b model using LMStudio or Ollamma on your M1 Mac, like neural chat v3.1 from Mistral/Intel.

marshray · on Nov 27, 2023

Looks like 32 GiB of DDR4 RAM is only $60 on the US retail site I checked, if you go with a platform that supports it.

sh1113 · on Nov 29, 2023

Quick question - Does anyone know if a cluster of 6 Jetson Tegra K1 could run a 12b or 20b model? I haven't been able to get consistent information regarding the ai capability of the boards.

andrewinardeer · on Nov 27, 2023

Silly question: I'm thinking of dipping my toes into local LLM.

Let's say I point my resources at getting one up and running that outputs tokens in an acceptable manner - then what? What can I do with a local LLM?

prepend · on Nov 27, 2023

Aside from not needing connectivity, the main thing I think, you can ask it anything that ChatGPT wouldn’t answer (eg, politics, health topics, etc).

runjake · on Nov 27, 2023

- Run different models[1]. Some models specialize in different tasks, some are uncensored.

- Be able to access your local LLM without an Internet connection.

- Feed it custom data and prompt sets for GPTs-like functionality without paying OpenAI $20/month. I mostly use Ollama, so I set these up with Modelfiles[2]. Other products have similar solutions.

1. https://huggingface.co/models

2. https://github.com/jmorganca/ollama#customize-your-own-model

jstarfish · on Nov 27, 2023

I've got an orca-3b GGML (koboldcpp) running on an RPi 4 and it sucks. It takes a few minutes just to process the prompt, then it's 1 token per second of output.

...which is usually crap (because it's only 3b) and needs to be regenerated anyway. It's not a viable solution for any generative use case. Mechanical Turk is faster and more reliable.

There are smaller models that I could try but 7b is already the lower limit of my patience. YMMV

contingencies · on Nov 27, 2023

Sounds a lot like: stop using laptops and SOCs, get a last-gen desktop.

Rastonbury · on Nov 27, 2023

Yeah getting it all up and running on my 1080 was exciting but immediately got deflated how painfully slow it was. Bit the bullet on a new card, night and day, not even a top end card

alok-g · on Nov 27, 2023

And I guess avoid Windows.

MuffinFlavored · on Nov 27, 2023

Why would you want to/what do you expect to get out of it?

A token per second... what is that token going to say? Accurate information?

gmt2027 · on Nov 27, 2023

One of the most powerful ways to integrate LLMs with existing systems is constrained generation. Libraries such as outlines[1] and instructor[2] allow structural specification of the expected outputs as regex patterns, simple types, jsonschema or pydantic models. Llama.cpp supports bnf grammars.

These outputs often consume significantly fewer tokens than chat or text completions.

[1] https://github.com/outlines-dev/outlines

[2] https://github.com/jxnl/instructor

Der_Einzige · on Nov 27, 2023

I wrote a whole paper about this idea but for syntactic or phonetic constraints:

https://paperswithcode.com/paper/most-language-models-can-be...

wkat4242 · on Nov 27, 2023

Because most of the online models are neutered so much they are almost useless. I get the morals lecture several times a day now.

pkoird · on Nov 27, 2023

A simple "yes / no" would probably be valuable even for being a single token.

marshray · on Nov 27, 2023

That's as fast or faster than most people can type.

canttestthis · on Nov 27, 2023

The information density of the average LLM response is much much lower than my coworkers' responses to my questions.

phito · on Nov 27, 2023

You're not prompting it right then.

collyw · on Nov 28, 2023

We are starting to sound like the response to agile being overrated, "you aren't doing it right". These comments never come with suggestions of how to do it right or what is being done wrong. They will never consider the possibility that the criticism is justified.

phito · on Dec 2, 2023

You can literally just ask the model to give succinct answers. If you have a chatGPT subscription, there is a pre prompt option where you can describe exactly how the model should answer you. It works pretty well in my experience.

LLMs are definitely not perfect and have limitations, but this isn't one of them imo.

It's also the case that most people describe their experience with LLM from their use by

MuffinFlavored · on Nov 29, 2023

I wasn't referring to the speed, I meant the quality/validity/value of the tokens.