> Currently Wikipedia mods/admins are dealing with AI generated articles being uploaded.
And they've always dealt with spam and low-quality submissions before. The system is working.
> As for NYT - I am assuming that lots of those stories are already available in some blog or the other.
I don't know what relevance that has to what we're talking about. The point is, train on the NYT. Blogs don't change what's on the NYT.
> The e-mail and web forums are 100% polluted with spam, which takes constant effort to remove.
They've always been polluted with low-quality content. So yes, either don't train on them, or only train on highly upvoted solutions, etc.
AI pollution isn't fundamentally any different from previous low-quality content and spam. It's not terribly difficult to determine which parts of the internet are known to be high-quality and train only on those. LLM's can't spam the NY Times.
> The counter point is that NYT content is already in the training data
That's not a counter point. My point is, train on things like the NYT, not random blogs. You can also whitelist the blogs you know are written by people, rather than randomly spidering the whole internet.
Also, no -- most of the NYT hasn't been copied into blogs. A small proportion of top articles, maybe.
> Highly upvoted messages on reddit are very regular bots copying older top comments.
What does that matter if the older top comment was written by a person? Also, Reddit is not somewhere you want to train in the first place if you're trying to generate a model where factual accuracy matters.
> Verification does not scale, while generation scales.
You don't need to verify everything -- you just need to verify enough stuff to train a model on. We're always going to have plenty of stuff that's sufficiently verified, whether from newspapers or Wikipedia or whitelisted blogs or books from verified publishers or whatever. It's not a problem.
You shouldn't be training on blogspam from random untrusted domains in the first place. So it doesn't matter if that junk is AI-generated or not.
>What does that matter if the older top comment was written by a person?
That is the entire issue? LLMs fail when they are trained on GenAI based content?
> Also, Reddit is not somewhere you want to train in the first place if you're trying to generate a model where factual accuracy matters.
There is no model that can create facutal accuracy. This would basically contravene the laws of physics. LLMs predict the next token.
>You shouldn't be training on blogspam from random untrusted domains in the first place. So it doesn't matter if that junk is AI-generated or not
Afaik, all the current models are trained on this corpus. That is how they work.
> There is no model that can create facutal accuracy.
Factual accuracy is not binary, it is a matter of degrees. Obviously training on content that is more factually correct will result in more factually correct next tokens. This is a pretty fundamental aspect of LLM's.
> Afaik, all the current models are trained on this corpus.
Then apologies for being so blunt, but you know wrong. There is a tremendous amount of work that goes on by the LLM companies in verifying, sanitizing, and structuring the training corpuses, using a wide array of techniques. The are absolutely not just throwing in blogspam and hoping for the best.
Thank you for being blunt. Let me attempt to speak in the same earnest tone.
You are contradicting the papers and work that the people who make the models are saying. Alternatively, you are looking at the dataset curation process with rose tinted glasses.
>There is a tremendous amount of work that goes on by the LLM companies in verifying, sanitizing, and structuring the training corpuses, using a wide array of techniques.
Common crawl is instrumental in building our models, 60% of GPT's training data was Common Crawl. (https://arxiv.org/pdf/2005.14165) pg 9.
CC in turn was never intended for LLM training, this misalignment in goals results in downstream issues like hate speech, NYT content, copyrighted content and more getting used to train models.
TLDR: 'NYT' and other high quality content has largely been ingested by models. Reddit and other sources play a large part in training current models.
While I appreciate your being blunt, this also means not being sharp and incisive. Perhaps precision would be required here to clarify your point.
Finally -
>Factual accuracy is not binary, it is a matter of degrees. Obviously training on content that is more factually correct will result in more factually correct next tokens
What. Come on, I think you wouldnt agree with your own statement after reading it once more -Factual correctness is not a matter of degrees.
Furthermore, facts dont automatically create facts. Calcuation, processing, testing and verification create more facts. Just putting facts together creates content.
Re: corpus content, I think we're talking past each other. I'm saying that current models aren't being blindly trained on untrusted blogspam, and that there's a lot of work done to verify, structure, transform, etc. And earlier models were trained with lower-quality content, as companies were trying to figure out how much scale mattered. Now they're paying huge amounts of money to improve the quality of what they ingest, to better shape the quality of output. What they take from Reddit, they're not blindly ingesting every comment from every user. My overall main point stands: we have viable, working, scalable mechanisms to avoid the "pollution" you're worried about.
> What. Come on, I think you wouldnt agree with your own statement after reading it once more -Factual correctness is not a matter of degrees.
Of course it is. An LLM can be correct 30% of the time, 80% of the time, 95% of the time, 99% of the time. If that's not a matter of degrees, I don't know what is. If you're looking for 100% perfection, I think you'll find that not even humans can do that. ;)
And they've always dealt with spam and low-quality submissions before. The system is working.
> As for NYT - I am assuming that lots of those stories are already available in some blog or the other.
I don't know what relevance that has to what we're talking about. The point is, train on the NYT. Blogs don't change what's on the NYT.
> The e-mail and web forums are 100% polluted with spam, which takes constant effort to remove.
They've always been polluted with low-quality content. So yes, either don't train on them, or only train on highly upvoted solutions, etc.
AI pollution isn't fundamentally any different from previous low-quality content and spam. It's not terribly difficult to determine which parts of the internet are known to be high-quality and train only on those. LLM's can't spam the NY Times.