LLMs can count other objects, so it's not like they're too dumb to count. So a possible model for what's going on is that the circuitry responsible for low-level image recognition has priors baked in that cause it to report unreliable information to parts that are responding for higher-order reason.
So back to the analogy, it could be as if the LLMs experience the equivalent of a very intense optical illusion in these cases, and then completely fall apart trying to make sense of it.
So back to the analogy, it could be as if the LLMs experience the equivalent of a very intense optical illusion in these cases, and then completely fall apart trying to make sense of it.