Have people on HN never heard of public ChatGPT conversations data sets? They've...

Have people on HN never heard of public ChatGPT conversations data sets? They've been mentioned multiple times in past HN conversations and I thought it'd be common knowledge here by now. Pretty much all open source models have been training on them for the past 2 years, it's common practice by now. And haven't people been having conversations about "synthetic data" for a pretty long time by now? Why is all of this suddenly an issue in the context of DeepSeek? Nobody made a fuss about this before.

And just because a model trains on some ChatGPT data, doesn't mean that that data is the majority. It's just another dataset.