All of these emergent properties of image and video models leads me to believe that evolution of animal intelligence around motility and visually understanding the physical environment might be "easy" relative to other "hard steps".
The more complex that an eye gets, the more the brain evolves not just the physics and chemistry of optics, but also rich feature sets about predator/prey labels, tracking, movement, self-localization, distance, etc.
These might not be separate things. These things might just come "for free".
So the brain does not necessarily receive 'raw' images to process to begin with, there is already a lot of high level data extracted at that point such as optical flow to detect moving objects.
And the occipital is developed around extraordinary levels of image separation, broken down into tiny areas of the input, scattered and woven for details of motion, gradient, contrast, etc.
If you train a system to memorize A-B pairs and then you normally use it to find B when given A, then it's not surprising that finding A when given B also works, because you trained it in an almost symmetrical fashion on A-B pairs, which are, obviously, also B-A pairs.
The more complex that an eye gets, the more the brain evolves not just the physics and chemistry of optics, but also rich feature sets about predator/prey labels, tracking, movement, self-localization, distance, etc.
These might not be separate things. These things might just come "for free".