> you almost never actually interact with people more than a half standard deviation away
I wasn't talking about the average person there but rather those who could also craft the high undergrad to low grad level explanations I referred to.
> This has not been a remotely credible claim for at least the past six months
Well it's happened to me within the past six months (actually within the past month) so I don't know what you want from me. I wasn't claiming that they never exhibit evidence of a mental model (can't prove a negative anyhow). There are cases where they have rendered a detailed explanation to me yet there were issues with it that you simply could not make if you had a working mental model of the subject that matched the level of the explanation provided (IMO obviously). Imagine a toddler spewing a quantum mechanics textbook at you but then uttering something completely absurd that reveals an inherent lack of understanding; not a minor slip up but a fundamental lack of comprehension. Like I said it's really weird and I'm not sure what to make of it nor how to properly articulate the details.
I'm aware it's not a rigorous claim. I have no idea how you'd go about characterizing the phenomenon.
How much of this is expectations setting by the heights models reach? i.e. of we could assess a consistent floor of model performance in a vacuum, would we say it's better at "AGI" than the bottom 0.1% of humans?
Not sure how to answer because we were off on a tangent there about mental models.
I think AGI is two things. Intelligence at a given task, which can be scored versus humans or otherwise. And generalization which is entirely separate. We already have superhuman non-general models in a few domains.
So I don't think that "better than AGI at % of humans" is a sensible statement, at least not initially.
Right now humans generalize to all integers while AI companies keep manually adding additional integers to a finite list and bystanders make claims of generality. If you've still got a finite list you aren't general regardless of how long the list is.
If at some point a model shows up that works on all even integers but not odd ones then I guess you could reasonably claim you had AGI that was 50% of what humans achieve. If a model that generalizes to all the reals shows up then it will have exceeded human generality by an infinite degree. We'll cross those bridges when we come to them - I don't think we're there yet.
Interestingly, I find that the models generalize decently well as long as the "training" (more analogous to that for humans) fits in (small enough) context. That's to say, "in-context learning" seems good enough for real use.
Given that models don't currently learn as they go isn't that exactly what this benchmark is testing? If the model needs to either have been explicitly trained in a similar environment or else to have a human manually input a carefully crafted prompt then it isn't general. The latter case is a human tuning a powerful tool.
If it can add the necessary bits to its own prompt while working on the benchmark then it's generalizing.
I wasn't talking about the average person there but rather those who could also craft the high undergrad to low grad level explanations I referred to.
> This has not been a remotely credible claim for at least the past six months
Well it's happened to me within the past six months (actually within the past month) so I don't know what you want from me. I wasn't claiming that they never exhibit evidence of a mental model (can't prove a negative anyhow). There are cases where they have rendered a detailed explanation to me yet there were issues with it that you simply could not make if you had a working mental model of the subject that matched the level of the explanation provided (IMO obviously). Imagine a toddler spewing a quantum mechanics textbook at you but then uttering something completely absurd that reveals an inherent lack of understanding; not a minor slip up but a fundamental lack of comprehension. Like I said it's really weird and I'm not sure what to make of it nor how to properly articulate the details.
I'm aware it's not a rigorous claim. I have no idea how you'd go about characterizing the phenomenon.