Honestly, the most astounding part of this announcement is their comparison to o3-mini with QA prompts.
EIGHTY PERCENT hallucination rate? Are you kidding me?
I get that the model is meant to be used for logic and reasoning, but nowhere does OpenAI make this explicitly clear. A majority of users are going to be thinking, "oh newer is better," and pick that.
Very nice catch, I was under the impression that o3-mini was "as good" as o1 on all dimensions. Seems the takeaway is that any form of quantization/distillation ends up hurting factual accuracy (but not reasoning performance), and there are diminishing returns to reducing hallucinations by model-scaling or RLHF'ing. I guess then that other approaches are needed to achieve single-digit "hallucination" rates. All of wikipedia compresses down to < 50GB though, so it's not immediately clear that you can't have good factual accuracy with a small sparse model
Yeah it was an abysmal result (any 50%+ hallucination result in that bench is pretty bad) and worse than o1-mini in the SimpleQA paper. On that topic, Sonnet 3.5 ”Old” hallucinates less than GPT-4.5, just for a bit of added perspective here.
EIGHTY PERCENT hallucination rate? Are you kidding me?
I get that the model is meant to be used for logic and reasoning, but nowhere does OpenAI make this explicitly clear. A majority of users are going to be thinking, "oh newer is better," and pick that.