Is there a guide out there for dummies on how to try a ChatGPT like instance of ...

lhl · on June 18, 2023

For Falcon specifically, this is easy, it's embedded here: https://huggingface.co/blog/falcon#demo or you can access the demo here: https://huggingface.co/spaces/HuggingFaceH4/falcon-chat

I just tested both and it's pretty zippy (faster than AMD's recent live MI300 demo).

For llama-based models, recently I've been using https://github.com/turboderp/exllama a lot. It has a Dockerfile/docker-compose.yml so it should be pretty easy to get going. llama.cpp is the other easy one and the most recent updates put it's CUDA support only about 25% slower and generally is a simple `make` with a flag depending on which GPU you support you want and has basically no dependencies.

Also, here's a Colab notebook that should let shows you run up to 13b quantized models (12G RAM, 80G disk, Tesla T4 16G) for free: https://colab.research.google.com/drive/1QzFsWru1YLnTVK77itW... (for Falcon, replace w/ Koboldcpp or ctransformers)

joshka · on June 18, 2023

Take a look at youtube vids for this. Mainly because you're going to see people show all the steps when presenting instead of skipping them when talking about what they did. E.g. https://www.youtube.com/watch?v=KenORQDCXV0

ianpurton · on June 21, 2023

I have a dockerfile here https://github.com/purton-tech/rust-llm-guide/tree/main/llm-... for running mpt-7b

docker run -it --rm ghcr.io/purton-tech/mpt-7b-chat

It's a big download due to the model size i.e. 5GB. The model is quantized and runs via the ggml tensor library. https://ggml.ai/.

api · on June 18, 2023

A small cheap VPS won’t have the compute or RAM to run these. The best way (and the intent) is to run it locally. A fast box with at least 32GiB of RAM (or VRAM for a GPU) can run many of the models that work with llama.cpp. For this 40G model you will need more like 48GiB of RAM.

Apple Silicon is pretty good for local models due to the unified CPU/GPU memory but a gaming PC is probably the most cost effective option.

If you want to just play around and don’t have a box big enough then temporarily renting one at Hetzner or OVH is pretty cost effective.

gardnr · on June 18, 2023

Falcon doesn't work in llama.cpp yet: https://github.com/ggerganov/llama.cpp/issues/1602

redox99 · on June 18, 2023

They said 1 or 2 bucks an hour. You can get an A100 for that.

londons_explore · on June 18, 2023

Try $100/hour for big LLM's... And you're probably going to need a fleet of 16 machines unless you want to quantize it and do inference only.

sp332 · on June 18, 2023

Where can you even find a machine for $100/hour? The most expensive one on this list is just over $5 and is definitely overkill for running a 40B model. https://www.paperspace.com/gpu-cloud-comparison

0x008 · on June 19, 2023

The prices there are totally not real. The A100 x 8 is listed for $5, whereas amazon price calculator lists it for $32 per hour...

sp332 · on June 19, 2023

It might have been a spot price. Spot prices now are higher than $5 but still under $10.