Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Perhaps I've missed something, but where will the infinite amounts of training data come from, for future improvements?

If these models will be trained on the outputs of themselves (and other models), then it's not so much a "flywheel", as it is a Perpetual Motion Machine.



Perplexity has a dubious idea based around harvesting user chats -> making service better -> getting more user prompts. I am quite unconvinced that user prompts and stored chats will materially improve an LLM that is trained on a trillion high quality tokens.

The second idea being kicked around is synthetic data will create a new fountain of youth for data that will also fix its reasoning abilities.


There's pretraining which is just raw text from the internet but there's also supervised preference data sourced from users.

Right now the edge is in acquiring the latter which OpenAI has a slight lead in




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: