This issue here is that people have different definitions of AGI. From the description. Getting 100% on this benchmark would be more than AGI and would qualify for ASI (Algorithmic Super Intelligence) not just AGI.
People are still debating whether these models exhibit any kind of intelligence and any kind of thinking. Setting the bar higher then necessary is welcome, but at this point I’m pretty sure everyone’s opinions are set in stone.
If you only outdo humans 50% of the time you're never going to get consensus on if you've qualified. Whereas outdoing 90% of humans on 90% of all the most difficult tasks we could come up with is going to be difficult to argue against.
This benchmark is only one such task. After this one there's still the rest of that 90% to go.
Beating humans isn't anywhere near sufficient to qualify as ASI. That's an entirely different league with criteria that are even more vague.
Even dumb humans are considered to have general intelligence. If the bar is having to outdo the median human, then 50% of humans don't have general intelligence.
Not true. We don't have a good definition for intelligence - it's very much an I'll know it when I see it sort of thing.
Frontier models are reliably providing high undergraduate to low graduate level customized explanations of highly technical topics at this point. Yet I regularly catch them making errors that a human never would and which betray a fatal lack of any sort of mental model. What are we supposed to make of that?
It's an exceedingly weird situation we find ourselves in. These models can provide useful assistance to literal mathematicians yet simultaneously show clear evidence of lacking some sort of reasoning the details of which I find difficult to articulate. They also can't learn on the job whatsoever. Is that intelligence? Probably. But is it general? I don't think so, at least not in the sense that "AGI" implies to me.
Once humanity runs out of examples that reliably trip them up I'll agree that they're "general" to the same extent that humans are regardless of if we've figured out the secrets behind things such as cohesive world models, self awareness, active learning during operation, and theory of mind.
> Yet I regularly catch them making errors that a human never would
I have yet to see a "error" that modern frontier models make that I could not imagine a human making - average humans are way more error prone than the kind of person who posts here thinks, because the social sorting effects of intelligence are so strong you almost never actually interact with people more than a half standard deviation away. (The one exception is errors in spatial reasoning with things humans are intimately familiar with - for example, clothing - because LLMs live in literary space, not physics space, and only know about these things secondhand)
> and which betray a fatal lack of any sort of mental model.
This has not been a remotely credible claim for at least the past six months, and it seemed obviously untrue for probably a year before then. They clearly do have a mental model of things, it's just not one that maps cleanly to the model of a human who lives in 3D space. In fact, their model of how humans interact is so good that you forget that you're talking to something that has to infer rather than intuit how the physical world works, and then attribute failures of that model to not having one.
> you almost never actually interact with people more than a half standard deviation away
I wasn't talking about the average person there but rather those who could also craft the high undergrad to low grad level explanations I referred to.
> This has not been a remotely credible claim for at least the past six months
Well it's happened to me within the past six months (actually within the past month) so I don't know what you want from me. I wasn't claiming that they never exhibit evidence of a mental model (can't prove a negative anyhow). There are cases where they have rendered a detailed explanation to me yet there were issues with it that you simply could not make if you had a working mental model of the subject that matched the level of the explanation provided (IMO obviously). Imagine a toddler spewing a quantum mechanics textbook at you but then uttering something completely absurd that reveals an inherent lack of understanding; not a minor slip up but a fundamental lack of comprehension. Like I said it's really weird and I'm not sure what to make of it nor how to properly articulate the details.
I'm aware it's not a rigorous claim. I have no idea how you'd go about characterizing the phenomenon.
How much of this is expectations setting by the heights models reach? i.e. of we could assess a consistent floor of model performance in a vacuum, would we say it's better at "AGI" than the bottom 0.1% of humans?
Not sure how to answer because we were off on a tangent there about mental models.
I think AGI is two things. Intelligence at a given task, which can be scored versus humans or otherwise. And generalization which is entirely separate. We already have superhuman non-general models in a few domains.
So I don't think that "better than AGI at % of humans" is a sensible statement, at least not initially.
Right now humans generalize to all integers while AI companies keep manually adding additional integers to a finite list and bystanders make claims of generality. If you've still got a finite list you aren't general regardless of how long the list is.
If at some point a model shows up that works on all even integers but not odd ones then I guess you could reasonably claim you had AGI that was 50% of what humans achieve. If a model that generalizes to all the reals shows up then it will have exceeded human generality by an infinite degree. We'll cross those bridges when we come to them - I don't think we're there yet.
Interestingly, I find that the models generalize decently well as long as the "training" (more analogous to that for humans) fits in (small enough) context. That's to say, "in-context learning" seems good enough for real use.
Given that models don't currently learn as they go isn't that exactly what this benchmark is testing? If the model needs to either have been explicitly trained in a similar environment or else to have a human manually input a carefully crafted prompt then it isn't general. The latter case is a human tuning a powerful tool.
If it can add the necessary bits to its own prompt while working on the benchmark then it's generalizing.
> I have yet to see a "error" that modern frontier models make that I could not imagine a human making
I mostly agree if "a human" is just any person we pluck of the street. What I still see with some regularity is the models (right now, primarily Opus 4.6 through Claude Code) making mistakes that humans:
- working in the same field/area as me (nothing particularly exotic, subfield of CS, not theory)
- with even a fraction of the declarative knowledge about the field as the LLM
- with even a fraction of frontier LLM abilities suggested by their perf in mathematical/informatics Olympiads
would never make. Basically, errors I'd never expect to see from a human coworker (or myself). I don't yet consider myself an expert in my subfield, and I'll almost certainly never be a top expert in it. Often the errors seem to present to me as just "really atrocious intuition." If the LLM ran with some of them they would cause huge problems.
In many regards the models are clearly superhuman already.
I think you are getting caught up on the intelligence part. That is the easy part since AGI doesn't have to be intelligent, it just has to be intelligence. If you look at early chess AI you will see that they are very weak compared to even a beginner human. The level of intelligence does not matter for a chess bot to be considered AI. It is that it is emulating intelligence that makes it AI.
>But is it general? I don't think so
I would consider it as general due to me being able to take any problem I can think of and the AI will make an attempt to solve it. Actually solving it is not a requirement for AGI. Being able to solve it just makes it smarter than an AGI that can't. You can trip up chess AI, but that don't stop them from being AI. So why apply that standard to AGI?
How am I getting caught up on it? I acknowledged that I think frontier models qualify as intelligent but disputed the "general" part. In fact for quite a few years now there have been many non-frontier models that I also consider intelligent within a very narrow domain.
I think stockfish reasonably qualifies as superhuman AI but not even remotely "general". Similarly alphafold.
> Actually solving it is not a requirement for AGI.
I think I see what you're trying to get at but taken as worded that can't possibly be right. Otherwise a dumb-as-a-brick automaton that made an "attempt" to tackle whatever you put in front of it would qualify as AGI.
It's certainly true. By definition. If the bar for general intelligence is being smarter than the median human, 50% of people won't reach the threshold for general intelligence. (And if the bar is beating the median in every cognitive test, then a much smaller fraction of people would qualify.)
People don't have a consistent definition of AGI, and the definitions have changed over the past couple years, but I think most people have settled on it meaning at least as smart as humans in every cognitive area. But that has to be compared to dumb people, not median. We don't want to say that regular people don't have general intelligence.
You are using terms like "smart" and "dumb" as if they have universally-accepted definitions. You can make up as many definitions of intelligence as you like (I would argue that is a sign of intelligence) but using those terms is certainly going to lead to circular reasoning.
It has nothing to do with circular reasoning or my personal opinions.
You can choose to define general intelligence in a way that excludes regular people if you like, but then you'd be using a weird definition that differs from how 99.9% of people define it. Humans have general intelligence by any common definition.
Defining it that way doesn't exclude ordinary people. That's an erroneous claim on your part.
Humans as a class exhibit certain capabilities. Thus we expect a class of algorithm to either roughly meet or exceed those capabilities across the board in order to be considered "general". It is clear that we have not yet achieved that.
First, what is your definition exactly? That it must be better than the median human intelligence?
You're trying to define a term in a way that's completely detached from how anyone uses it. If we discover an alien race with an IQ of 95, people aren't going to say they don't have general intelligence.
We haven't defined an exact cutoff for what counts as general intelligence, but it has to include regular people with an IQ in the 70s that don't have a serious mental disability. If an AI can do every single cognitive task as well as a stupid person, it would have to qualify as having general intelligence if the stupid person qualified. It doesn't matter if the AI beats the median person 0% of the time, as long as it beats someone who is considered to have general intelligence at the task.
Approximately that an unaided agent must, with no outside assistance, be able to solve ~90% of the most difficult tasks that we throw at it with a ~90% success rate. It's not a precise definition but that's approximately where I stand on the matter.
> You're trying to define a term in a way that's completely detached from how anyone uses it.
I disagree and believe that it is you who is attempting to redefine it to mean something it doesn't. See this definition of AGI (link shamelessly stolen from someone else in this comment section) from before the latest AI hypecycle started warping things. https://web.archive.org/web/20150108000749/https://en.wikipe...
> If we discover an alien race with an IQ of 95, people aren't going to say they don't have general intelligence.
Said race as a class would presumably be capable of meeting or exceeding my above criteria with appropriate exceptions made for tasks that are fundamentally incompatible with their biology of course.
Your attempt to compare to individual humans is an error. AGI applies on the class level, not the individual level. Consider if a company built a humanoid robot with superhuman performance at shot put. They market it as being the equal of humans at athletics. But then it turns out that it barely plays volleyball at a novice level, with even fairly poor human opponents able to defeat it handily. That is not equal to humans as a whole at athletics even though it might potentially be the equal of any given human at any given task.
Alternatively, if you could purchase the robot in different configurations and combined the full set of configurations covered every sport then the situation would be murkier.
> See this definition of AGI (link shamelessly stolen from someone else in this comment section) from before the latest AI hypecycle started warping things.
Every definition on that page, both theoretical and operational, match my definition and not yours. Notice that none of them would exclude an AGI with an IQ around 90, provided it's intelligence is general.
> Approximately that an unaided agent must, with no outside assistance, be able to solve ~90% of the most difficult tasks that we throw at it with a ~90% success rate. It's not a precise definition but that's approximately where I stand on the matter.
This isn't your definition. How hard are these "most difficult tasks"? Can 50% of humans solve them? 10%? If it were literally the most difficult problems, they would be the ones 0% of humans have solved.
> Said race as a class would presumably be capable of meeting or exceeding my above criteria
Some might, but this hypothetical alien race does not. Do you still consider them to have general intelligence if they can merely do everything a 95 IQ person could?
> Your attempt to compare to individual humans is an error. AGI applies on the class level
According only to you. LLMs are benchmarked individually. No one runs a benchmark where Claude gets half the questions right, GPT gets the other half right, and it's reported as a combined perfect score as a class. Instead the each score 50%. (Not that I think current AIs can solve the harder benchmark problems. The point is they are measured individually.)
No one else ascribes general intelligence only to a class. You can talk to one average person (or alien), give them some tests, and determine they have general intelligence. This is how everyone else uses the term.
In retrospect, it seems obvious that we hit AGI by a reasonable "at least as intelligent as some humans" definition when o3 came out, and everything since then has been goalpost moving by people who have higher and higher bars for which percentile human they would be willing to employ (or consider intellectually capable). People should really just use the term "ASI" when their definition of AGI excludes the majority of humans.
Edit: Here's the guy who coined the term saying we're already there. Everything else is arguing over definitions.
> Well, Lars, I INVENTED THE TERM and I say we have achieved AGI. Current models perform at roughly high-human level in command of language and general knowledge, but work thousands of times faster than us. Still some major deficiencies remain but they're falling fast.
It’s not that simple since each problem is supposed to be distinct and different enough that no single program can solve multiple of them properly. No problem spec is provided as well iiuc so you can’t simply ask an LLM to generate code without doing other things.
A human can sit down to play a game with unknown rules and write a spec as he goes. If a model can't even figure out to attempt that, let alone succeed at it, then it most certainly isn't an example of "general" intelligence.
> A human can sit down to play a game with unknown rules and write a spec as he goes.
Some humans can. Many, if not most humans cannot. A significant enough fraction of humans have trouble putting together Ikea furniture that there are memes about its difficulty. You're vastly overestimating the capabilities of the average human. Working in tech puts you in probably the top ~1-5% of capability to intuit and understand rules, but it distorts your intuition of what a "reasonable" baseline for that is.
Yes, I am aware. However an idealized human can do so. Analogously, there are plenty of humans that can't run an 8 minute mile but if your bipedal robot is physically incapable of ever doing that then it isn't reasonable to claim having achieved human level athletic performance. When it can compete in every Olympic event you can claim human level performance at athletics in general.
If the model can't generalize to arbitrary tasks on its own without any assistance then it doesn't qualify as a general intelligence. AGI to my mind means meeting or exceeding idealized human performance on the vast majority of arbitrary tasks that are cherrypicked to be particularly challenging.
It's not obvious at all. And I would say pretty much impossible without using machine learning. Even for ARC-AGI-1 there is no GOFAI program that scores high.