There was a mini-uproar when GPT-4o-mini (an obviously "dumber" model) outscored...

There was a mini-uproar when GPT-4o-mini (an obviously "dumber" model) outscored claude-3.5-sonnet on Chatbot Arena, so much so that LMSYS released a subset of the battles: https://huggingface.co/spaces/lmsys/gpt-4o-mini_battles

You can review for yourself and decide if it was justified (you can compare based on W/L/T responses and matchups). Generally, Claude still has more refusals (easy wins for the model that actually answers the request), often has worse formatting (arguable if this is better, but people like it more), and is less verbose (personally, I'd prefer the right answer with less words, but ChatArena users generally disagree).

If you look at the questions (and Chat Arena and Wildchat analyses), most people aren't using LLMs for math, reasoning, or even coding - if anything the arena usage is probably overly skewed to reasoning/trick questions due to the subset of people poking at the models.

Of course, different people value different things. I've almost exclusively been using 3.5 Sonnet since it came out because it's been the best code assistant and Artifacts are great, only falling back to GPT-4o for occasional Code Interpreter work (for tricky problems, Mistral's Codestral actually seems to be a good fallback, often being able to debug issues that neither of those models can, despite being a tiny model in comparison).