I'm curious, what people do with these smaller models?

Beretta_Vexee · 2025-01-30T16:05:10 1738253110

RAG mainly, Feature extraction, tagging, Document and e-mail classification. You don't need a 24B parameter to know whether the e-mail should go to accounting or customer support.

Panoramix · 2025-01-30T18:44:18 1738262658

Would this work for non-text data? Like finding outliers in a time series or classifying trends, that kind of thing

Beretta_Vexee · 2025-01-31T09:01:56 1738314116

Don't use LLM to do something that some python + panda lines could do better. Buy a data scientist a coffee and have a chat.

pheeney · 2025-01-30T17:55:04 1738259704

What models would you recommend for basic classification if you don't need a 24B parameter one?

josh-sematic · 2025-01-30T18:51:11 1738263071

You might find this comparison chart helpful: https://www.airtrain.ai/blog/how-15-top-llms-perform-on-clas...

Note: from October; also I work at Airtrain

elorant · 2025-01-31T13:32:13 1738330333

I’m using Llama-3 8B to classify html files. It’s surprisingly good, and I run it on an RTX 4060 Ti at 8-bit quantization. No complains so far.

Beretta_Vexee · 2025-01-31T09:06:00 1738314360

There's no alternative to testing with your own data. The majority of our data is in French, and our benchmarks differ greatly from public benchmarks generally based on English documents.

celestialcheese · 2025-01-30T16:56:49 1738256209

Classification, tagging tasks. Way easier than older ML techniques and very fast to implement.

mattgreenrocks · 2025-01-30T17:07:44 1738256864

When compared against more traditional ML approaches, how do they fare in terms of quality?

spmurrayzzz · 2025-01-30T18:54:15 1738263255

Historically the problem with using LLMs for the super simple conventional NLP stuff is that they were hard to control in terms of output. If you wanted a one-word answer for a classification task, you'd often have to deal with it responding in a paragraph. This obviously hurts precision and accuracy quite a bit. There were tricks you could use (like using few-shot examples or GBNF grammars or training low-rank adapters or even re-asking the model) to constrain output a bit, but they weren't perfect.

Over the last 12-18 months though, the instruction-following capabilities of the models have improved substantially. This new mistral model in particular is fantastic at doing what you ask.

My approach to this personally and professionally is to just benchmark. If I have a classification task, I use a tiny model first, eval both, and see how much improvement I'd get using an LLM. Generally speaking though, the vram costs are so high for the latter that its often not worth it. It really is a case-by-case decision though. Sometimes you want one generic model to do a bunch of tasks rather than train/finetune a dozen small models that you manage in production instead.

andrewgross · 2025-01-30T18:55:27 1738263327

Super easy to get started, but lacking for larger datasets where you want to understand a bit more about predictions. You generally lose things like prediction probability (though this can be recovered if you chop the head off and just assign output logits to classes instead of tokens), repeatability across experiments, and the ability to tune the model by changing the data. You can still do fine tuning, though itll be more expensive and painfaul than a BERT model.

Still, you can go from 0 to ~mostly~ clean data in a few prompts and iterations, vs potentially a few hours with a fine tuning pipeline for BERT. They can actually work well in tandem to bootstrap some training data and then use them together to refine your classification.

whymauri · 2025-01-30T22:32:47 1738276367

After prompt optimization with something like DSPy and a good eval set, significantly faster and just about as good. Occasionally higher accuracy on held out data than human labelers given a policy/documentation e.g. customer support cases.

ignoramous · 2025-01-30T15:50:07 1738252207

Mistral repeatedly emphasize on "accuracy" and "latency" for this Small (24b) model; which to me means (and as they also point out):

- Local virtual assistants.

- Local automated workflows.

Also from TFA:

  Our customers are evaluating Mistral Small 3 across multiple industries, including:

  - Financial services customers for fraud detection
  - Healthcare providers for customer triaging
  - Robotics, automotive, and manufacturing companies for on-device command and control
  - Horizontal use cases across customers include virtual customer service, and sentiment and feedback analysis.

frankfrank13 · 2025-01-30T16:04:45 1738253085

They're fast, I used 4o mini to run the final synthesis in a CoT app and to do initial entity/value extraction in an ETL. Mistral is pretty good for code completions too, if I was in the Cursor business I would consider a model like this for small code-block level completions, and let the bigger models handle chat, large requests, etc.

_boffin_ · 2025-01-30T15:50:40 1738252240

Cleaning messy assessor data. Email draft generation.

superkuh · 2025-01-30T16:41:48 1738255308

Not spend $6000 on hardware because they run on computers we already have. But more seriously, they're fine and plenty fun for making recreational IRC bots.