> you almost never actually interact with people more than a half standard devia...

winwang · 2026-03-26T19:17:45 1774552665

How much of this is expectations setting by the heights models reach? i.e. of we could assess a consistent floor of model performance in a vacuum, would we say it's better at "AGI" than the bottom 0.1% of humans?

fc417fc802 · 2026-03-26T23:23:59 1774567439

Not sure how to answer because we were off on a tangent there about mental models.

I think AGI is two things. Intelligence at a given task, which can be scored versus humans or otherwise. And generalization which is entirely separate. We already have superhuman non-general models in a few domains.

So I don't think that "better than AGI at % of humans" is a sensible statement, at least not initially.

Right now humans generalize to all integers while AI companies keep manually adding additional integers to a finite list and bystanders make claims of generality. If you've still got a finite list you aren't general regardless of how long the list is.

If at some point a model shows up that works on all even integers but not odd ones then I guess you could reasonably claim you had AGI that was 50% of what humans achieve. If a model that generalizes to all the reals shows up then it will have exceeded human generality by an infinite degree. We'll cross those bridges when we come to them - I don't think we're there yet.

winwang · 2026-03-26T23:34:19 1774568059

Interestingly, I find that the models generalize decently well as long as the "training" (more analogous to that for humans) fits in (small enough) context. That's to say, "in-context learning" seems good enough for real use.

But of course, that's not quite "long term"

fc417fc802 · 2026-03-27T02:02:31 1774576951

Given that models don't currently learn as they go isn't that exactly what this benchmark is testing? If the model needs to either have been explicitly trained in a similar environment or else to have a human manually input a carefully crafted prompt then it isn't general. The latter case is a human tuning a powerful tool.

If it can add the necessary bits to its own prompt while working on the benchmark then it's generalizing.