> Do you mean that HRM and TRM are specifically trained on a small dataset of AR...

shawntan · 2025-10-08T04:23:47 1759897427

I should probably also add: It's long been known that Universal / Recursive Transformers are able to solve _simple_ synthetic tasks that vanilla transformers cannot.

Just check out the original UT paper, or some of it's follow ups: Neural Data Router, https://arxiv.org/abs/2110.07732; Sparse Universal Transformers (SUT), https://arxiv.org/abs/2310.07096. There is even theoretical justification for why: https://arxiv.org/abs/2503.03961

The challenge is actually scaling them up to be useful as LLMs as well (I describe why it's a challenge in the SUT paper).

It's hard to say with the way ARC-AGI is allowed to be evaluated if this is actually what is at play. My gut tells me, given the type of data that's been allowed in the training set, that some leakage of the evaluation has happened in both HRM and TRM.

But because as a field we've given up on actually carefully ensuring training and test don't contaminate, we just decide it's fine and the effect is minimal. Especially considering LLMs, the test set example leaking into the dataset is merely a drop in the bucket (I don't believe we should be dismissing it this way, but that's a whole 'nother conversation).

With these models that are challenge-targeted, it becomes a much larger proportion of what influences the model behaviour, especially if the open evaluation sets are there for everyone to look at and simply generate more. Now we don't know if we're generalising or memorising.

jononor · 2025-10-09T19:10:05 1760037005

I think that the best way to address this potential ARC overfitting, would be to create more benchmarks - that are similar in concept, focusing on fluid intelligence, but from another angle than ARC.

Of course it is quite costly and also requires some "marketing" to actually get it established.

shawntan · 2025-10-14T02:00:50 1760407250

This would not help if no proper constraints are established on what data can and cannot be trained on. And maybe just figuring out what the goal of the benchmark is.

If it is to test generalisation capability, then what data the model being evaluated is trained on is crucial to making any conclusions.

Look at the construction of this synthetic dataset for example: https://arxiv.org/pdf/1711.00350

YeGoblynQueenne · 2025-10-10T11:36:31 1760096191

>> Now we don't know if we're generalising or memorising.

"Now" starts around 1980 I'd say. Everyone in the field tweaks their models until they perform well on the "held-out" test set, so any ability to estimate generalisation from test-set performance goes out the window. The standard 80/20 train/test split makes it even worse.

I personally find it kind of scandalous that nobody wants to admit this in the field and yet many people are happy to make big claims about generalisation, like e.g. the "mystery" of generalising overparameterised neural nets.

shawntan · 2025-10-14T01:55:48 1760406948

You can have benchmarks with specifically constructed train-test splits for task-specific models. Train only on the train, then your results on test should be what is reported.

You can still game those benchmarks (tune your hyperparameters after looking at test results), but that setting measures for generalisation on the test set _given_ the training set specified. Using any additional data should be going against the benchmark rules, and should not be compared on the same lines.

YeGoblynQueenne · 2025-10-17T08:04:28 1760688268

What I'm pointing out above is that everyone games the benchmarks in the way that you say, by tuning their models until they do well on the test set. They train, they test, and they iterate until they get it right. At that point any results are meaningless for the purpose of estimating generalisation because models are effectively overfit to the test set, without ever having to train on the test set directly.

And this is standard practice, like everyone does it all the time and I believe a sizeable majority of researchers don't even understand that what they do is pointless because that's what they've been taught to do, by looking at each other's work and from what their supervisors tell them to do etc.

Btw, we don't really care about generalisation on the test set, per se. The point of testing on a held-out test set is that it's supposed to give you an estimate of a model's generalisation on truly unseen data, i.e. data that was not available to the researchers during training. That's the generalisation we're really interested in. And the reason we're interested in that is that if we deploy a model in a real-world situation (rare as that may be) it will have to deal with unseen data, not with the training data, nor with the test data.

cubefox · 2025-10-08T13:32:15 1759930335

> Now we don't know if we're generalising or memorising.

The Arc HRM blog post says:

> [...] we set out to verify HRM performance against the ARC-AGI-1 Semi-Private dataset - a hidden, hold-out set of ARC tasks used to verify that solutions are not overfit [...] 32% on ARC-AGI-1 is an impressive score with such a small model. A small drop from HRM's claimed Public Evaluation score (41%) to Semi-Private is expected. ARC-AGI-1's Public and Semi-Private sets have not been difficulty calibrated. The observed drop (-9pp) is on the high side of normal variation. If the model had been overfit to the Public set, Semi-Private performance could have collapsed (e.g., ~10% or less). This was not observed.

shawntan · 2025-10-14T02:06:09 1760407569

The question I keep coming back to is whether ARC-AGI is intended to evaluate generalisation to the task at hand. This would then mean that the test data has a meaningful distribution shift from the training data, and only a model that can perform said generalisation can do well.

This would all go out the window if the model being evaluated can _see_ the type of distribution shift it would encounter during test time. And it's unclear whether the shift is the same in the hidden set.

There are questions about the evaluations that arise from the large model performance against the smaller models, especially given the ablation studies. Are the large models trained on the same data as these tiny models? Should they be? If they shouldn't, then why are we allowing these small models access to these in their training data?

ACCount37 · 2025-10-09T12:00:11 1760011211

ARC-AGI 1 and 2 are spatial reasoning benchmarks. ARC-AGI 3 is advanced spatial reasoning with agentic flavor.

They're adversarial benchmarks - they intentionally hit the weak point of existing LLMs. Not "AGI complete" by any means. But not useless either.

shawntan · 2025-10-09T17:50:15 1760032215

This is a point I wish more people would recognise.