Is there a guide out there for dummies on how to try a ChatGPT like instance of this on a VM cheaply? eg pay $1 or $2 an hour for a point and click experience with the instruct version of this. A docker image perhaps.
Reading posts on r/LocalLLAMA is people’s trial and error experiences, quite random.
I just tested both and it's pretty zippy (faster than AMD's recent live MI300 demo).
For llama-based models, recently I've been using https://github.com/turboderp/exllama a lot. It has a Dockerfile/docker-compose.yml so it should be pretty easy to get going. llama.cpp is the other easy one and the most recent updates put it's CUDA support only about 25% slower and generally is a simple `make` with a flag depending on which GPU you support you want and has basically no dependencies.
Take a look at youtube vids for this. Mainly because you're going to see people show all the steps when presenting instead of skipping them when talking about what they did. E.g. https://www.youtube.com/watch?v=KenORQDCXV0
A small cheap VPS won’t have the compute or RAM to run these. The best way (and the intent) is to run it locally. A fast box with at least 32GiB of RAM (or VRAM for a GPU) can run many of the models that work with llama.cpp. For this 40G model you will need more like 48GiB of RAM.
Apple Silicon is pretty good for local models due to the unified CPU/GPU memory but a gaming PC is probably the most cost effective option.
If you want to just play around and don’t have a box big enough then temporarily renting one at Hetzner or OVH is pretty cost effective.
Where can you even find a machine for $100/hour? The most expensive one on this list is just over $5 and is definitely overkill for running a 40B model. https://www.paperspace.com/gpu-cloud-comparison
Reading posts on r/LocalLLAMA is people’s trial and error experiences, quite random.