Seeing OpenAI and Anthropic go different routes here is interesting. It is worth moving past the initial knee jerk reaction of this model being unimpressive and some of the comments about "they spent a massive amount of money and had to ship something for it..."
* Anthropic appears to be making a bet that a single paradigm (reasoning) can create a model which is excellent for all use cases.
* OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.
Based on all of the comments from OpenAI, GPT 4.5 is absolutely massive, and with that size comes the ability to store far more factual data. The scores in ability oriented things - like coding - don't show the kind of gains you get from reasoning models but the fact based test, SimpleQA, shows a pretty large jump and a dramatic reduction in hallucinations. You can imagine a scenario where GPT4.5 is coordinating multiple, smaller, reasoning agents and using its factual accuracy to enhance their reasoning, kind of like ruminating on an idea "feels" like a different process than having a chat with someone.
I'm really curious if they're actually combining two things right now that could be split as well, EQ/communications, and factual knowledge storage. This could all be a bust, but it is an interesting difference in approaches none-the-less, and worth considering that OpenAI could be right.
> * OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.
Seems inaccurate as their most recent claim I've seen is that they expect this to be their last non-reasoning model, and are aiming to provide all capacities together in the future model releases (unifying the GPT-x and o-x lines)
See this claim on TFA:
> We believe reasoning will be a core capability of future models, and that the two approaches to scaling—pre-training and reasoning—will complement each other.
> After that, a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks.
> In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.
You could read this as unifying the models or building a unified systems which coordinate multiple models. The second sentence, to me, implies that o3 will still exist, it just won't be standalone, which matches the idea I shared above.
Ah, great point. Yes, the wording here would imply that they're basically planning on building scaffolding around multiple models instead of having one more capable Swiss Army Knife model.
I would feel a bit bummed if GPT-5 turned out not to be a model, but rather a "product".
> know when to think for a long time or not, and generally be useful for a very wide range of tasks.
I'm going to call it now - no customer is actually going to use this. It'll be a cute little bonus for their chatbot god-oracle, but virtually all of their b2b clients are going to demand "minimum latency at all times" or "maximum accuracy at all times."
> OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.
The high level block diagrams for tech always end up converging to those found in biological systems.
Yeah, I don't know enough real neuroscience to argue either side. What I can say is I feel like this path is more like the way that I observe that I think, it feels like there are different modes of thinking and processes in the brain, and it seems like transformers are able to emulate at least two different versions of that.
Once we figure out the frontal cortex & corpus callosum part of this, where we aren't calling other models over APIs instead of them all working in the same shared space, I have a feeling we'll be on to something pretty exciting.
> Anthropic appears to be making a bet that a single paradigm (reasoning) can create a model which is excellent for all use cases.
I don't think that is their primary motivation. The announcement post for Claude 3.7 was all about code which doesn't seem to imply "all use cases". Code this, new code tool that, telling customers that they look forward to what they build, etc. Very little mention of other use cases on the new model announcement at all. Their usage stats they published are telling - 80%+ or more of queries to Claude are all about code. i.e. I actually think while they are thinking of other use cases; they see the use case of code specifically as the major thing to optimize for.
OpenAI, given its different customer base and reach, is probably aiming for something more general.
IMO they all think that you need an "ensemble" of models with different capabilities to optimise for different use cases. Its more about how much compute resources each company has and what they target with those resources. Anthrophic I'm assuming has less compute resources and a narrower customer base so it economically may make sense to optimise just for that.
That's possible, my counter point would be that if that was the case Anthropic would have built a smaller reasoning model instead of doing a "full" Claude. Instead, they built something which seems to be flexible across different types of responses.
It can never be just reasoning, right? Reasoning is the multiplier on some base model, and surely no amount of reasoning on top of something like gpt-2 will get you o1.
This model is too expensive right now, but as compute gets cheaper — and we have to keep in mind, that it will — having a better base to multiply with will enable things that just more thinking won't.
You can try for yourself with the distilled R1's that Deepseek released. The qwen-7b based model is quite impressive for its size and it can do a lot with additional context provided. I imagine for some domains you can provide enough context and let the inference time eventually solve it, for others you can't.
Ever since those kids demo'd their fact checking engine here, which was just Input -> LLM -> Fact Database -> LLM -> LLM -> Output I have been betting that it will be advantageous to move in this general direction.
Maybe, I’m inclined to think OpenAI believes the way I laid it out though, specifically because of their focus on communication and EQ in 4.5. It seems like they believe the large, non-reasoning model, will be “front of house.”
Or they’ll use some kind of trained router which sends the request to the one it thinks it should go to first.
* Anthropic appears to be making a bet that a single paradigm (reasoning) can create a model which is excellent for all use cases.
* OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.
Based on all of the comments from OpenAI, GPT 4.5 is absolutely massive, and with that size comes the ability to store far more factual data. The scores in ability oriented things - like coding - don't show the kind of gains you get from reasoning models but the fact based test, SimpleQA, shows a pretty large jump and a dramatic reduction in hallucinations. You can imagine a scenario where GPT4.5 is coordinating multiple, smaller, reasoning agents and using its factual accuracy to enhance their reasoning, kind of like ruminating on an idea "feels" like a different process than having a chat with someone.
I'm really curious if they're actually combining two things right now that could be split as well, EQ/communications, and factual knowledge storage. This could all be a bust, but it is an interesting difference in approaches none-the-less, and worth considering that OpenAI could be right.