I’m not sure the arcagi are interesting benchmarks, for one they are image based and for two most people I show them too have issues understanding them, and in fact I had issues understanding them.
Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.
"most people I show them too have issues understanding them, and in fact I had issues understanding them"
???
those benchmarks are so extremely simple they have basically 100% human approval rates, unless you are saying "I could not grasp it immediately but later I was able to after understanding the point" I think you and your friends should see a neurologist. And I'm not mocking you, I mean seriously, those are tasks extremely basic for any human brain and even some other mammals to do.
No, I think I saw the graphs on someone's channel, but maybe I misinterpreted the results. But to be fair, my point never depended on 100% of the participants being right 100% of the questions, there are innumerous factors that could affect your performance on those tests, including the pressure. The AI also had access to lenient conventions, so it should be "fair" in this sense.
Either way, there's something fishy about this presentation, it says:
"ARC-AGI-1 WAS EASILY BRUTE-FORCIBLE", but when o3 initially "solved" most of it the co-founder or ARC-PRIZE said:
"Despite the significant cost per task, these numbers aren't just the result of applying brute force compute to the benchmark. OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.", he was saying confidently that it would not be a result of brute-forcing the problems.
And it was not the first time,
"ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training."
Now they are saying ARC-AGI-2 is not bruteforcible, what is happening there? They didn't provided any reasoning for why one was bruteforcible and the other not, nor how they are so sure about that.
They "recognized" that it could be brute-forced before, but in a way less expressive manner, by explicitly stating it would need "unlimited resources and time" to solve. And they are using the non-bruteforceability in this presentation as a point for it.
---
Also, I mentioned mammals because those problems are of an order that mammals and even other animals would need to solve in reality for a diversity of cases. I'm not saying that they would literally be able to take the test and solve it, nor to understand this is a test, but that they would need to solve problems of similar nature in reality. Naturally this point has it's own limits, but it's not easily discarded as you tried to do.
> my point never depended on 100% of the participants being right 100% of the questions
You told someone that their reasoning is so bad they should get checked by a doctor. Because they didn't find the test easy, even though it averages 60% score per person. You've been a dick to them while significantly misrepresenting the numbers - just stop digging.
The second test scores 60%, the first was way higher.
And I specifically said ""unless you are saying "I could not grasp it immediately but later I was able to after understanding the point" I think you and your friends should see a neurologist"", to which this person did not responded.
I saw the tests, solved some, I suspect the variability here is more a question of methodology than an inherent problem for those people. I also never stated that my point depended on those people scoring 100% specifically on the tests, even if it is in fact extremely easy (and it is, the objective of this test is to literally make tests that most humans could easily beat but that would be hard for an AI) variability will still exist and people with different perceptions would skew the results, this is expected. "Significantly misrepresenting the numbers" is also a stretch, I only mentioned the numbers ONE time in my point, most of it was about that inherent nature (or at least, the intended nature) of the tests.
So on the edge, if he was not able to understand them at all, and this was not just a problem of grasping the problem, my point was that this would possibly indicate a neurological problem, or developmental, due to the nature of them. It's not a question of "you need to get all of them right", his point was that he was unable to understand them at all, that it confused them to an understanding level.
Also mammals? What mammals could even understand we were giving it a test?
Have you seen them or shown them to average people? I’m sure the people who write them understand them but if you show these problems to average people in the street they are completely clueless.
This is a classic case of some phd ai guys making a benchmark and not really considering what average people are capable of.
Look, these insanely capable ai systems can’t do these problems but the boys in the lab can do them, what a good benchmark.
quoting my own previous response:
> Also, I mentioned mammals because those problems are of an order that mammals and even other animals would need to solve in reality for a diversity of cases. I'm not saying that they would literally be able to take the test and solve it, nor to understand this is a test, but that they would need to solve problems of similar nature in reality. Naturally this point has it's own limits, but it's not easily discarded as you tried to do.
---
> Have you seen them or shown them to average people? I’m sure the people who write them understand them but if you show these problems to average people in the street they are completely clueless.
I can show them to people on my family, I'll do it today and come back with the answer, it's the best way of testing that out.
The ARC-AGI-2 paper https://arxiv.org/pdf/2505.11831#figure.4 uses a non-representative sample, success rate differs widely across participants and "final ARC-AGI-2 test pairs were solved, on average, by 75% of people who attempted them. The average test-taker solved 66% of tasks they attempted. 100% of ARC-AGI-2 tasks were solved by at least two people (many were solved by more) in two attempts or less."
Certainly those non-representative humans are much better than current models, but they're also far from scoring 100%.
arc agi is the closest any widely used benchmark is coming to an IQ test, its straight logic/reasoning. Looking at the problem set its hard for me to choose a better benchmark for "when this is better than humans we have agi"
There are humans who cannot do arc agi though so how does an LLM not doing it mean that LLMs don’t have general intelligence?
LLMs have obviously reached the point where they are smarter than almost every person alive, better at maths, physics, biology, English, foreign languages, etc.
But because they can’t solve this honestly weird visual/spatial reasoning test they aren’t intelligent?
That must mean most humans on this planet aren’t generally intelligent too.
> LLMs have obviously reached the point where they are smarter than almost every person alive, better at maths, physics, biology, English, foreign languages, etc.
I agree. The problem I have with the Chinese Room thought experiment is: just as the human who mechanically reading books to answer questions they don't understands does not themselves know Chinese, likewise no neuron in the human brain knows how the brain works.
The intelligence, such as it is, is found in the process that generated the structure — of the translation books in the Chinese room, of the connectome in our brains, and of the weights in an LLM.
What comes out of that process is an artefact of intelligence, and that artefact can translate Chinese or whatever.
Because all current AI take a huge number of examples to learn anything, I think it's fair to say they're not particularly intelligent — but likewise, they can to an extent make up for being stupid by being stupid very very quickly.
But: this definition of intelligence doesn't really fit "can solve novel puzzle", as there's a lot of room for getting good at that my memorising lot of things that puzzle-creators tend to do.
And any mind (biological or synthetic) must learn patterns before getting started: the problem of induction* is that no finite number of examples is ever guaranteed to be sufficient to predict the next item in a sequence, there is always an infinite set of other possible solutions in general (though in reality bounded by 2^n, where n = the number of bits required to express the universe in any given state).
I suspect, but cannot prove, that biological intelligence learns from fewer examples for a related reason, that our brains have been given a bias by evolution towards certain priors from which "common sense" answers tend to follow. And "common sense" is often wrong, c.f. Aristotelian physics (never mind Newtonian) instead of QM/GR.
The LLMs are not just memorising stuff though, they solve math and physics problems better than almost every person alive. Problems they've never seen before. They write code which has never been seen before better than like 95% of active software engineers.
I love how the bar for are LLMs smart just goes up every few months.
In a year it will be, well, LLMs didn't create totally breakthrough new Quantum Physics, it's still not as smart as us... lol
All code has been seen before, thats why LLMs are so good at writing it.
I agree things are looking up for LLMs, but the semantics do matter here. In my experience LLMs are still pretty bad at solving novel problems(like arc agi 2) which is why I do not believe they have much intelligence. They seem to have started doing it a little, but are still mostly regurgitating.
Well if we get on to hearing, our ears do a lot better than our eyes. From the source entering into hole on either side of our head we can split the sound into a myriad of frequencies and gather a lot of information from it.
In a similar situation from a single point of light out eyes would say "sort of blue-ish". Most visible frequency information is ignored.
It seems reasonable to me to say that soundwaves exist in the world, but music only exists in our brains. There is something added in our perception of the soundwaves that turns them into music.
Something exits in the world, it seems. Sound waves and music are merely our interpretation of it. Maybe what actually exists in the world is music, and sound waves are what our brain invents when the music is too complex for it to grasp?
I guess my point is that sound waves are not terribly controversial. A simple device can measure sound waves present in the air. It would be much more difficult to build a device that told you whether music was playing. Reasonable people could disagree about whether certain sound waves constitute music or noise.
I think there is still an issue. I think it is counting down to the next 7:39 local time. Right now it says opens in 22h 50m for me (I am in mountain time) (this was at 8:49 MST / 10:49 EST)
Even if you already have TikTok on you phone, you can’t use it now and get the following message:
> Sorry, TikTok isn't available right now
> A law banning TikTok has been enacted in the U.S. Unfortunately, that means you can't use TikTok for now.
> We are fortunate that President Trump has indicated that he will work with us on a solution to reinstate TikTok once he takes office. Please stay tuned!
If only our climate emergency had such a hero to swoop in and save the planet. Pretty sure I saw a headline on HN last week that we already blew through the 1.5C limit, I doubt it got a fraction of the 2000+ hits this story got yesterday.
That’s dishonest of ByteDance. The legislated ban doesn’t mean the existing US users can’t use it right now. This message simply means that ByteDance made a business decision to shut it down for the existing US user base.
I understand several reasons they might be making that business decision, including supportability reasons. I also get why they might be choosing to explain the situation dishonestly. But understanding their potential motivations doesn’t make a dishonest explanation any less dishonest.