RAG mainly, Feature extraction, tagging, Document and e-mail classification.
You don't need a 24B parameter to know whether the e-mail should go to accounting or customer support.
There's no alternative to testing with your own data. The majority of our data is in French, and our benchmarks differ greatly from public benchmarks generally based on English documents.
Historically the problem with using LLMs for the super simple conventional NLP stuff is that they were hard to control in terms of output. If you wanted a one-word answer for a classification task, you'd often have to deal with it responding in a paragraph. This obviously hurts precision and accuracy quite a bit. There were tricks you could use (like using few-shot examples or GBNF grammars or training low-rank adapters or even re-asking the model) to constrain output a bit, but they weren't perfect.
Over the last 12-18 months though, the instruction-following capabilities of the models have improved substantially. This new mistral model in particular is fantastic at doing what you ask.
My approach to this personally and professionally is to just benchmark. If I have a classification task, I use a tiny model first, eval both, and see how much improvement I'd get using an LLM. Generally speaking though, the vram costs are so high for the latter that its often not worth it. It really is a case-by-case decision though. Sometimes you want one generic model to do a bunch of tasks rather than train/finetune a dozen small models that you manage in production instead.
Super easy to get started, but lacking for larger datasets where you want to understand a bit more about predictions. You generally lose things like prediction probability (though this can be recovered if you chop the head off and just assign output logits to classes instead of tokens), repeatability across experiments, and the ability to tune the model by changing the data. You can still do fine tuning, though itll be more expensive and painfaul than a BERT model.
Still, you can go from 0 to ~mostly~ clean data in a few prompts and iterations, vs potentially a few hours with a fine tuning pipeline for BERT. They can actually work well in tandem to bootstrap some training data and then use them together to refine your classification.
After prompt optimization with something like DSPy and a good eval set, significantly faster and just about as good. Occasionally higher accuracy on held out data than human labelers given a policy/documentation e.g. customer support cases.
Mistral repeatedly emphasize on "accuracy" and "latency" for this Small (24b) model; which to me means (and as they also point out):
- Local virtual assistants.
- Local automated workflows.
Also from TFA:
Our customers are evaluating Mistral Small 3 across multiple industries, including:
- Financial services customers for fraud detection
- Healthcare providers for customer triaging
- Robotics, automotive, and manufacturing companies for on-device command and control
- Horizontal use cases across customers include virtual customer service, and sentiment and feedback analysis.
They're fast, I used 4o mini to run the final synthesis in a CoT app and to do initial entity/value extraction in an ETL. Mistral is pretty good for code completions too, if I was in the Cursor business I would consider a model like this for small code-block level completions, and let the bigger models handle chat, large requests, etc.
Not spend $6000 on hardware because they run on computers we already have. But more seriously, they're fine and plenty fun for making recreational IRC bots.