I believe you are correct here, yet at the same time I think we're about 2 orders of magnitude off on the amount of compute power needed to do it effectively.
I think the first order of mag will be in tree of thought processing. The amount of additional queries we need to run to get this to work is at least 10x, but I don't believe 100x.
I think the second order of mag will be multimodal inference so the models can ground themselves in 'reality' data. Saying, "the brick layed on the ground and did not move" and "the brick floated away" are only deciable based on the truthfulness of all the other text corpus it's looked at. At least to me it gets even more interesting when you tie it into environmental data that is more likely to be factual, such as massive amounts of video.
I think the first order of mag will be in tree of thought processing. The amount of additional queries we need to run to get this to work is at least 10x, but I don't believe 100x.
I think the second order of mag will be multimodal inference so the models can ground themselves in 'reality' data. Saying, "the brick layed on the ground and did not move" and "the brick floated away" are only deciable based on the truthfulness of all the other text corpus it's looked at. At least to me it gets even more interesting when you tie it into environmental data that is more likely to be factual, such as massive amounts of video.