This isn't measuring the same thing, and recent results are so extreme that they...

This isn't measuring the same thing, and recent results are so extreme that they call into question whether the results would map to the real-world implementation Anthropic tried. Is it really the case that Grok 4 can manage a vending machine many times more profitably than a human, or is it exploiting some property of the simulated environment?