Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think next year's AI benchmarks are going to be like this project: https://www.anthropic.com/research/project-vend-1

Give the AI tools and let it do real stuff in the world:

"FounderBench": Ask the AI to build a successful business, whatever that business may be - the AI decides. Maybe try to get funded by YC - hiring a human presenter for Demo Day is allowed. They will be graded on profit / loss, and valuation.

Testing plain LLM on whiteboard-style question is meaningless now. Going forward, it will all be multi-agent systems with computer use, long-term memory & goals, and delegation.



This sounds like a terrible idea to me, you're training intelligent computer to aim for power. It's fine as long as they're bad but if they get good then we have a problem




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: