Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there a guide out there for dummies on how to try a ChatGPT like instance of this on a VM cheaply? eg pay $1 or $2 an hour for a point and click experience with the instruct version of this. A docker image perhaps.

Reading posts on r/LocalLLAMA is people’s trial and error experiences, quite random.



For Falcon specifically, this is easy, it's embedded here: https://huggingface.co/blog/falcon#demo or you can access the demo here: https://huggingface.co/spaces/HuggingFaceH4/falcon-chat

I just tested both and it's pretty zippy (faster than AMD's recent live MI300 demo).

For llama-based models, recently I've been using https://github.com/turboderp/exllama a lot. It has a Dockerfile/docker-compose.yml so it should be pretty easy to get going. llama.cpp is the other easy one and the most recent updates put it's CUDA support only about 25% slower and generally is a simple `make` with a flag depending on which GPU you support you want and has basically no dependencies.

Also, here's a Colab notebook that should let shows you run up to 13b quantized models (12G RAM, 80G disk, Tesla T4 16G) for free: https://colab.research.google.com/drive/1QzFsWru1YLnTVK77itW... (for Falcon, replace w/ Koboldcpp or ctransformers)


Take a look at youtube vids for this. Mainly because you're going to see people show all the steps when presenting instead of skipping them when talking about what they did. E.g. https://www.youtube.com/watch?v=KenORQDCXV0


I have a dockerfile here https://github.com/purton-tech/rust-llm-guide/tree/main/llm-... for running mpt-7b

docker run -it --rm ghcr.io/purton-tech/mpt-7b-chat

It's a big download due to the model size i.e. 5GB. The model is quantized and runs via the ggml tensor library. https://ggml.ai/.


A small cheap VPS won’t have the compute or RAM to run these. The best way (and the intent) is to run it locally. A fast box with at least 32GiB of RAM (or VRAM for a GPU) can run many of the models that work with llama.cpp. For this 40G model you will need more like 48GiB of RAM.

Apple Silicon is pretty good for local models due to the unified CPU/GPU memory but a gaming PC is probably the most cost effective option.

If you want to just play around and don’t have a box big enough then temporarily renting one at Hetzner or OVH is pretty cost effective.


Falcon doesn't work in llama.cpp yet: https://github.com/ggerganov/llama.cpp/issues/1602


They said 1 or 2 bucks an hour. You can get an A100 for that.


Try $100/hour for big LLM's... And you're probably going to need a fleet of 16 machines unless you want to quantize it and do inference only.


Where can you even find a machine for $100/hour? The most expensive one on this list is just over $5 and is definitely overkill for running a 40B model. https://www.paperspace.com/gpu-cloud-comparison


The prices there are totally not real. The A100 x 8 is listed for $5, whereas amazon price calculator lists it for $32 per hour...


It might have been a spot price. Spot prices now are higher than $5 but still under $10.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: