I have a machine with 3108 TI's that I do a batch with, sending the question first to a LLM and then an LRM, returning to review the faster results and killing the job if they are acceptable. Ollama or just llama.cpp on podman makes this trivial.
But knowing* what model will be better will be impossible, only broad heuristics that may or may not be correct for any individual prompt could be used.
While there are better options if you were buying them today, an old out of date system with out of date GPUs works well in this batch model.
gemma-3-27b-it-Q6_K_L works fine with these, and that mixed with an additional submit to DeepSeek-R1-Distill-Qwen-32B is absolutely fine on that system that would just be shut down otherwise.
I have a very bright line about inter-customer leakage risk prevention that may be irrational but with that mixture I find that I am better looking at scholarly papers than trying the commercial models.
My primary task is FP64 throughput limited, and thus I am stuck on Titan V as it is ~6 times faster than the 4090 and 5 times faster than the 5090 is the only reason I don't have newer GPUS.
You can add 41080ti at 200w limit with common PSU's and get the memory, but performance is limited by the pci bus at 31080ti.
As they seem to sell for the same price, I would probably buy the Titan V today, but the point being is that if you are fine with the even smaller models, you can run them queries in parallel or even cross verify, which dramatically helps with planning tasks even with the foundational models.
But series/parallel runs do a lot, and if you are using them for code, running a linter etc... on the structured output saves a lot of time evaluating the multiple response.
No connection to them at all, but bartowski on hugging face puts a massive amount of time and effort into re-quantizing models.
If you don't a restriction like my FP64 need, you can get 70b models running on two 24Gb gpus without much 'cost' to accuracy.
> My primary task is FP64 throughput limited, and thus I am stuck on Titan V as it is ~6 times faster than the 4090 and 5 times faster than the 5090 is the only reason I don't have newer GPUS.
interesting. Very interesting. Why fp64 as opposed to BF16? different sort of model? i don't even know where to find fp64 models (not that i've looked).
also Bartowski may be on huggingface but they're also part of the LM Studio group, and frequently chat on that discord. actually, at least 3 of the main model converter / quant people are on that discord.
I haven't got two 24GB cards, yet, but maybe soon, with the way people are hogging the 5000 series.
edit: i realize that they're increasing the marketing FLOPS by halving the resolution, the current gen stuff is all "fast" at FP16 (or BF16 - brainfloat 16 bit). So when nvidia finishes and releases a card with double the FLOPS at 8 bit, will that card be 8 times slower at fp64?
while researching this i discovered another fast fp64 card is the R9 280x by amd/ati. although the memory is weak, only 3GB! But i suppose if you need the numerical accuracy, there's always that, and those cards are like $40 (in the us, on ebay, sold listings), compared to $400 for the titan. if you need 4x the ram though i guess you're stuck paying 10x the price!
> There's no equivalent to "does everything kinda well" like chatgpt or Gemini on local, except maybe the 70B and larger, but those are slow
Is there something like a “prompt router”, that can automatically decide what model to use based on the type of prompt/task?