To be fair to OP, I just added this to our blog after their comment, in response to the correct criticisms that our text didn't make it clear how bad GPT-5.2's labels are.
LLMs have always been very subhuman at vision, and GPT-5.2 continues in this tradition, but it's still a big step up over GPT-5.1.
ChatGPT pricing is the same. API pricing is +40% per token, though greater token efficiency means that cost per task is not always that much higher. On some agentic evals we actually saw costs per task go down with GPT-5.2. It really depends on the task though; your mileage may vary.
Yep, the point we wanted to make here is that GPT-5.2's vision is better, not perfect. Cherrypicking a perfect output would actually mislead readers, and that wasn't our intent.
That would be a laudable goal, but I feel like it's contradicted by the text:
> Even on a low-quality image, GPT‑5.2 identifies the main regions and places boxes that roughly match the true locations of each component
I would not consider it to have "identified the main regions" or to have "roughly matched the true locations" when ~1/3 of the boxes have incorrect labels. The remark "even on a low-quality image" is not helping either.
Edit: credit where credit is due, the recently-added disclaimer is nice:
> Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.
Yeah, what it's calling RAM slots is the CMOS battery. What it's calling the PCIE slot is the interior side of the DB-9 connector. RAM slots and PCIE slots are not even visible in the image.
It just overlaid a typical ATX pattern across the motherboard-like parts of the image, even if that's not really what the image is showing. I don't think it's worthwhile to consider this a 'local recognition failure', as if it just happened to mistake CMOS for RAM slots.
Imagine it as a markdown response:
# Why this is an ATX layout motherboard (Honest assessment, straight to the point, *NO* hallucinations)
1. *RAM* as you can clearly see, the RAM slots are to the right of the CPU, so it's obviously ATX
2. *PCIE* the clearly visible PCIE slots are right there at the bottom of the image, so this definitely cannot be anything except an ATX motherboard
3. ... etc more stuff that is supported only by force of preconception
--
It's just meta signaling gone off the rails. Something in their post-training pipeline is obviously vulnerable given how absolutely saturated with it their model outputs are.
Troubling that the behavior generalizes to image labeling, but not particularly surprising. This has been a visible problem at least since o1, and the lack of change tells me they do not have a real solution.
Eh, I'm no shill but their marketing copy isn't exactly the New York Times. They're given some license to respond to critical feedback in a manner that makes the statements more accurate without the same expectations of being objective journalism of record.
Look, just give the Qwen3-vl models a go. I've found them to be fantastic as this kind of thing so far, and what I'm seeing on display here, is laughable in comparison. Close source / closed weight paid model with worse performance than open? common. OpenAI really is a bubble.
I think you may have inadvertently misled readers in a different way. I feel misled after not catching the errors myself, assuming it was broadly correct, and then coming across this observation here. Might be worth mentioning this is better but still inaccurate. Just a bit of feedback, I appreciate you are willing to show non-cherry-picked examples and are engaging with this question here.
Edit: As mentioned by @tedsanders below, the post was edited to include clarifying language such as: “Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.”
Thanks for the feedback - I agree our text doesn't make the models' mistakes clear enough. I'll make some small edits now, though it might take a few minutes to appear.
When I saw that it labeled DP ports as HDMI I immediately decided that I am not going to touch this until it is at least 5x better with 95% accuracy with basic things.
That's a far more dangerous territory. A machine that is obviously broken will not get used. A machine that is subtly broken will propagate errors because it will have achieved a high enough trust level that it will actually get used.
Think 'Therac-25', it worked in 99.5% of the time. In fact it worked so well that reports of malfunctions were routinely discarded.
There was a low-level Google internal service that worked so well that other teams took a hard dependency on it (against advice). So the internal team added a cron job to drop it every once in a while to get people to trust it less :-)
Oh and you guys don't mislead people ever. Your management is just completely trustworthy, and I'm sure all you guys are too. Give me a break, man. If I were you, I would jump ship or you're going to be like a Theranos employee on LinkedIn.
I disagree. I think the whole organization is egregious and full of Sam Altman sycophants that are causing a real and serious harm to our society. Should we not personally attack the Nazis either? These people are literally pushing for a society where you're at a complete disadvantage. And they're betting on it. They're banking on it.
Yes, GPT-5.2 still has adaptive reasoning - we just didn't call it out by name this time. Like 5.1 and codex-max, it should do a better job at answering quickly on easy queries and taking its time on harder queries.
Why have "light" or "low" thinking then? I've mentioned this before in other places, but there should only be "none," "standard," "extended," and maybe "heavy."
Extended and heavy are about raising the floor (~25% and ~45% or some other ratio respectively) not determining the ceiling.
Not sure what you mean, Altman does that fake-humility thing all the time.
It's a marketing trick; show honesty in areas that don't have much business impact so the public will trust you when you stretch the truth in areas that do (AGI cough).
I'm confident that GP is good faithed though. Maybe I am falling for it. Who knows? It doesn't really matter, I just wanted to be nice to the guy. It takes some balls posting as OpenAi employee here, and I wish we heard from them more often, as I am pretty sure all of them lurk around.
It's the only reasonable choice you can make. As an employee with stock options you do not want to get trashed on Hackernews because this affects your income directly if you try to conduct a secondary share sale or plan to hold until IPO.
Once the IPO is done, and the lockup period is expired, then a lot of employees are planning to sell their shares. But until that, even if the product is behind competitors there is no way you can admit it without putting your money at risk.
I know HN commenters like to see themselves as contrarians, as do I sometimes, but man… this seems like a serious stretch to assume such malicious intent that an employee of the world’s top AI name would astroturf a random HN thread about a picture on a blog.
I’m fairly comfortable taking this OpenAI employee’s comment at face value.
Frankly, I don’t think a HN thread will make a difference to his financial situation, anyway…
Malicious ? No, and this is far from astroturfing, he even speaks as "we". It's just a logical move to defend your company when people claim your product is buggy.
There is no other logical move, this is what I am saying, contrary to people above say this requires a lot of courage. It's not about courage, it's just normal and logic (and yes Hackernews matters a lot, this place is a very strong source of signal for investors).
- reasoning_effort "minimal" was not renamed to "none"; "none" is a new, faster level supported by GPT-5.1 but not GPT-5
- there's no good reason it's called "chat" instead of "Instant"
- gpt-5.1 and gpt-5.1-chat are different models, even though they both reason now. gpt-5.1 is more factual and can think for much longer. most people want gpt-5.1, unless the use case is ChatGPT-like or they prefer its personality.
Double checked when the model started getting worse, and realized I was exaggerating a little bit on the timeframe. November 5th is when it got worse for me. (1 week in AI feels like a month..)
Was there a (hidden) rollout for people using GPT-5-thinking? If not, I have been entirely mistaken.
You people need a PR guy, I'm serious. OpenAI is the first company I've ever seen that comes across as actively trying to be misanthropic in its messaging. I'm probably too old-fashioned, but this honestly sounds like Marlboro launching the slogan "lung cancer for the weak of mind".
Tractors automate parts of farmwork. Dishwashers automate parts of dishwashing. Computers automate parts of accounting. People seem to like paying for things that make their lives easier. If people don't like something, they are free to not buy it.
Here's my prediction : The rapid progress of AI will make money as an accounting practice irrelevant. Take the concept of "Future is already here but unevenly distributed." When we will have true abundance, what the elites will target is the convex hull of progress, they want to be in control of leading edge / leading wavefront and its direction and who has access to resources and decision making. In such a scenario of abundance, populace will have access to iPhone 50 but the Elites will have access to iPhone 500. i.e. uneven distribution. Elites would like to directly control which resource gets allocated to which projects. Elon is already doing that with his immense clout. This implies we would have a sort of multidimensional resource based economy.
>When we will have true abundance, what the elites will target is the convex hull of progress
We have abundance. The elite took it all.
Every single dollar gained from increased productivity over the past 50 years has been given to them, by changes in tax structures primarily, and their current claim is that actually we need to give them more.
Because that's all they know. More. A bigger slice of the pie. They demonstrably do not want to make the pie bigger.
Making the whole pie bigger dilutes their control and power over resources. They'd rather just take more of the current pie.
It’s going to take money, what if your AGI has some tax policy ideas that are different from the inference owners?
Why would they let that AGI out into the wild?
Let’s say you create AGI. How long will it take for society to recover? How long will it take for people of a certain tax ideology to finally say oh OK, UBI maybe?
The last part is my main question. How long do you think it would take our civilization to recover from the introduction of AGI?
Edit: sama gets a lot of shit, but I have to admit at least he used to work on the UBI problem, orb and all. However, those days seem very long gone from the outside, at least.
If you are genuine in your questions, I will give them a shot.
AGI applied to the inputs (or supply chain) of what is needed for inference (power, DC space, chips, network equipment, etc) will dramatically reduced costs of inference. Most of the costs of stuff today are driven by the scarcity of "smart people's time". The raw resources of material needed are dirt cheap (cheaper than water). Transforming raw resources into useful high tech is a function of applied intelligence. Replace the human intelligence with machine intelligence, and costs will keep dropping (faster than the curve they are already on). Economic history has already shown this effect to be true; as we develop better tools to assist human productivity, the unit cost per piece of tech drops dramatically (moore's law is just one example, everything that tech touches experiences this effect).
If you look at almost any universal problem with the human condition, one important bottleneck to improving it is intelligence (or "smart people's time").
I am sorry to quote myself from another comment, however this is partially the same response:
> This is a lovely idea, it truly is. However, if you don't somehow bootstrap AGI, won't "the money" just use (hijack) it to make more money, in the normal dark ways?
> I always figured that we will begin to see signs of AGI when a quant firm starts making really outsized returns.
> "Hey AGI, how do we avoid being exposed as having implemented AGI?" - Now there is the next trillion dollar and very boring question.
In this seemingly undeniable environment, what will be the timescale for our civilization to recover with the media and inference owners having full control of "AGI" - until the benefits finally trickle down to general humanity? 1 year, 10 years, 5 generations, possibly never? Assuming you accept my priors, what is that timescale in your view?
>sama gets a lot of shit, but I have to admit at least he used to work on the UBI problem, orb and all.
The Orb was never ever ever meant to ever fix anything about UBI or even help it happen.
It was always about creating a hyped enough cryptocoin he could use as an ATM to fund himself and other things. That's what all these assholes got into crypto for, like, demonstrably. It was always about taking investment from fools who could not punish you for screwing them over, and then taking your bag and going home to play.
The orb was a sales and marketing gimmick. There's nothing it could do that couldn't be done by commodity fingerprint scanners.
I thought that the orb was an alternative to the iOS human verification system. I was initially disgusted by the cryptocurrency association, but in retrospect, this seems like one of the very few applications of an immutable ledger.
I am not someone working on AGI but I think a lot of people work backwards from the expected outcome.
Expected outcome is usually something like a Post-Scarcity society, this is a society where basic needs are all covered.
If we could all live in a future with a free house and a robot that does our chores and food is never scarce we should works towards that, they believe.
The intermiddiete steps aren't thought out, in the same way that for example the communist manifesto does little to explain the transition from capitalism to communism. It simply says there will be the need for things like forcing the bourgiese to join the common workers and there will be a transition phase but no clear steps between either system.
Similarly many AGI proponents think in terms of "wouldnt it be cool if there was an AI that did all the bits of life we dont like doing", without systemic analysis that many people do those bits because they need money to eat for example.
Automating work and making life easier for people are two entirely different things. Automating work tends to lead to life becoming harder for people - mostly on account of who is benefiting from the automation - basically that better life aint gonna happen under capitalism
try the `opencode` cli and see how much nicer it is to use. the visual differentiation between user and agent messages is essential, plus formatted output with colors preserved from console commands, LSP support, parallel tool calls, syntax highlighting for code snippets, proper display of all diffs, etc.
>GPT-5 showed significant improvement only in one benchmark domain - which is Telecom. The other ones have been somehow overlooked during model presentation - therefore we won’t bother about them either.
I work at OpenAI and you can partly blame me for our emphasis on Telecom. While we no doubt highlight the evals that make us look good, let me defend why the emphasis on Telecom isn't unprincipled cherry picking.
Telecom was made after Retail and Airline, and fixes some of their problems. In Retail and Airline, the model is graded against a ground truth reference solution. Grading against a reference solution makes grading easier, but has the downside that valid alternative solutions can receive scores of 0 by the automatic grading. This, along with some user model issues, is partly why Airline and Retail scores stopped climbing with the latest generations of models and are stuck around 60% / 80%. I'd bet you $100 that a superintelligence would probably plateau around here too, as getting 100% requires perfect guessing of which valid solution is written as the reference solution.
In Telecom, the authors (Barres et al.) made the grading less brittle by grading against outcome states, which may be achieved via multiple solutions, rather than by matching against a single specific solution. They also improved the user modeling and some other things too. So Telecom is the much better eval, with a much cleaner signal, which is partly why models can score as high as 97% instead of getting mired at 60%/80% due to brittle grading and other issues.
Even if I had never seen GPT-5's numbers, I like to think I would have said ahead of time that Telecom is much better than Airline/Retail for measuring tool use.
Incidentally, another thing to keep in mind when critically looking at OpenAI and others reporting their scores on these evals is that the evals give no partial credit - so sometimes you can have very good models that do all but one thing perfectly, which results in very poor scores. If you tried generalizing to tasks that don't trigger that quirk, you might get much better performance than the eval scores suggest (or vice versa, if your tasks trigger a quirk not present in the eval).
Appreciated the response! I noticed the same when I ran tau2 myself on gpt-5 and 4.1, where gpt-5 is really good at looking at tool results and interleaving those with thinking, while 4.1/o3 struggles to decide the proper next tool to use even with thinking. To some extent, gpt-5 is too good at figuring out the right tool to use in one go. Amazing progress.
This sounds very vague, what does scoring good at Telecom mean?
Can we get some (hypothetical) examples of ground truths?
For example for the Airline domain, what kind of facts are these ground truth facts? All the airports, the passenger lines between them, etc? Or does it mean detailed knowledge of the airplane manuals for pilots, maintenance, ...?
LLMs have always been very subhuman at vision, and GPT-5.2 continues in this tradition, but it's still a big step up over GPT-5.1.
One way to get a sense of how bad LLMs are at vision is to watch them play Pokemon. E.g.,: https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...
They still very much struggle with basic vision tasks that adults, kids, and even animals can ace with little trouble.
reply