Not super knowledgeable about all the different specs of the different Orange PI and Rasberry PI models. I'm looking for something relatively cheap that can connect to WiFi and USB. I want to be able to run at least 13b models at a a decent tok / s.
Also open to other solutions. I have a Mac M1 (8gb RAM) and upgrading the computer itself would be cost prohibitive for me.
I was getting 2.2 tokens/s with the llama-2-13b-chat.Q4_K_M.gguf and 3.3 tokens/s with llama-2-13b-chat.Q3_K_S.gguf. With Mistral and Zephyr, the Q4_K_M versions, I was getting 4.4 tokens/s.
A few days ago I bought another stick of 16GB RAM ($30) and for some reason that escapes me, the inference speed doubled. So now I'm getting 6.5 tokens/s with llama-2-13b-chat.Q3_K_S.gguf, which for my needs gives the same results as Q4_K_M, and 9.1 tokens/s with Mistral and Zephyr. Personally, I can barely keep up with reading at 9 tokens/s (if I also have to process the text and check for errors).
If I wasn't considering getting an Nvidia 4060 Ti for Stable Diffusion, I would seriously be considering a used RX 580 8GB ($75) and run Llama Q4_K_M entirely on the GPU or offload some layers when using a 30B model.