Give the AI tools and let it do real stuff in the world:
"FounderBench": Ask the AI to build a successful business, whatever that business may be - the AI decides. Maybe try to get funded by YC - hiring a human presenter for Demo Day is allowed. They will be graded on profit / loss, and valuation.
Testing plain LLM on whiteboard-style question is meaningless now. Going forward, it will all be multi-agent systems with computer use, long-term memory & goals, and delegation.
This sounds like a terrible idea to me, you're training intelligent computer to aim for power. It's fine as long as they're bad but if they get good then we have a problem
Give the AI tools and let it do real stuff in the world:
"FounderBench": Ask the AI to build a successful business, whatever that business may be - the AI decides. Maybe try to get funded by YC - hiring a human presenter for Demo Day is allowed. They will be graded on profit / loss, and valuation.
Testing plain LLM on whiteboard-style question is meaningless now. Going forward, it will all be multi-agent systems with computer use, long-term memory & goals, and delegation.