I think next year's AI benchmarks are going to be like this project: https://www...

I think next year's AI benchmarks are going to be like this project: https://www.anthropic.com/research/project-vend-1

Give the AI tools and let it do real stuff in the world:

"FounderBench": Ask the AI to build a successful business, whatever that business may be - the AI decides. Maybe try to get funded by YC - hiring a human presenter for Demo Day is allowed. They will be graded on profit / loss, and valuation.

Testing plain LLM on whiteboard-style question is meaningless now. Going forward, it will all be multi-agent systems with computer use, long-term memory & goals, and delegation.