Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Have people on HN never heard of public ChatGPT conversations data sets? They've been mentioned multiple times in past HN conversations and I thought it'd be common knowledge here by now. Pretty much all open source models have been training on them for the past 2 years, it's common practice by now. And haven't people been having conversations about "synthetic data" for a pretty long time by now? Why is all of this suddenly an issue in the context of DeepSeek? Nobody made a fuss about this before.

And just because a model trains on some ChatGPT data, doesn't mean that that data is the majority. It's just another dataset.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: