This isn't measuring the same thing, and recent results are so extreme that they call into question whether the results would map to the real-world implementation Anthropic tried. Is it really the case that Grok 4 can manage a vending machine many times more profitably than a human, or is it exploiting some property of the simulated environment?