Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Currently Wikipedia mods/admins are dealing with AI generated articles being uploaded.

And they've always dealt with spam and low-quality submissions before. The system is working.

> As for NYT - I am assuming that lots of those stories are already available in some blog or the other.

I don't know what relevance that has to what we're talking about. The point is, train on the NYT. Blogs don't change what's on the NYT.

> The e-mail and web forums are 100% polluted with spam, which takes constant effort to remove.

They've always been polluted with low-quality content. So yes, either don't train on them, or only train on highly upvoted solutions, etc.

AI pollution isn't fundamentally any different from previous low-quality content and spam. It's not terribly difficult to determine which parts of the internet are known to be high-quality and train only on those. LLM's can't spam the NY Times.



>The system is working. Given that this is an issue I've heard from wikipedia admins themselves, I am impressed by your confidence.

>The point is, train on the NYT. Blogs don't change what's on the NYT.

The counter point is that NYT content is already in the training data because its already replicated or copied into random blogs.

>So yes, either don't train on them, or only train on highly upvoted solutions, etc.

Highly upvoted messages on reddit are very regular bots copying older top comments. Mods already have issues with AI comments.

----

TLDR: Pollution is already happening. Verification does not scale, while generation scales.


> The counter point is that NYT content is already in the training data

That's not a counter point. My point is, train on things like the NYT, not random blogs. You can also whitelist the blogs you know are written by people, rather than randomly spidering the whole internet.

Also, no -- most of the NYT hasn't been copied into blogs. A small proportion of top articles, maybe.

> Highly upvoted messages on reddit are very regular bots copying older top comments.

What does that matter if the older top comment was written by a person? Also, Reddit is not somewhere you want to train in the first place if you're trying to generate a model where factual accuracy matters.

> Verification does not scale, while generation scales.

You don't need to verify everything -- you just need to verify enough stuff to train a model on. We're always going to have plenty of stuff that's sufficiently verified, whether from newspapers or Wikipedia or whitelisted blogs or books from verified publishers or whatever. It's not a problem.

You shouldn't be training on blogspam from random untrusted domains in the first place. So it doesn't matter if that junk is AI-generated or not.


>What does that matter if the older top comment was written by a person? That is the entire issue? LLMs fail when they are trained on GenAI based content?

> Also, Reddit is not somewhere you want to train in the first place if you're trying to generate a model where factual accuracy matters.

There is no model that can create facutal accuracy. This would basically contravene the laws of physics. LLMs predict the next token.

>You shouldn't be training on blogspam from random untrusted domains in the first place. So it doesn't matter if that junk is AI-generated or not

Afaik, all the current models are trained on this corpus. That is how they work.


> There is no model that can create facutal accuracy.

Factual accuracy is not binary, it is a matter of degrees. Obviously training on content that is more factually correct will result in more factually correct next tokens. This is a pretty fundamental aspect of LLM's.

> Afaik, all the current models are trained on this corpus.

Then apologies for being so blunt, but you know wrong. There is a tremendous amount of work that goes on by the LLM companies in verifying, sanitizing, and structuring the training corpuses, using a wide array of techniques. The are absolutely not just throwing in blogspam and hoping for the best.


Thank you for being blunt. Let me attempt to speak in the same earnest tone.

You are contradicting the papers and work that the people who make the models are saying. Alternatively, you are looking at the dataset curation process with rose tinted glasses. >There is a tremendous amount of work that goes on by the LLM companies in verifying, sanitizing, and structuring the training corpuses, using a wide array of techniques.

Common crawl is instrumental in building our models, 60% of GPT's training data was Common Crawl. (https://arxiv.org/pdf/2005.14165) pg 9.

CC in turn was never intended for LLM training, this misalignment in goals results in downstream issues like hate speech, NYT content, copyrighted content and more getting used to train models.

https://foundation.mozilla.org/en/research/library/generativ... (This article is to establish the issues with CC as a source of LLM training)

https://facctconference.org/static/papers24/facct24-148.pdf (this details those issues.)

Firms, such as the NYT are now stopping common crawl from archiving their pages. https://www.wired.com/story/the-fight-against-ai-comes-to-a-...

-----

TLDR: 'NYT' and other high quality content has largely been ingested by models. Reddit and other sources play a large part in training current models.

While I appreciate your being blunt, this also means not being sharp and incisive. Perhaps precision would be required here to clarify your point.

Finally -

>Factual accuracy is not binary, it is a matter of degrees. Obviously training on content that is more factually correct will result in more factually correct next tokens

What. Come on, I think you wouldnt agree with your own statement after reading it once more -Factual correctness is not a matter of degrees.

Furthermore, facts dont automatically create facts. Calcuation, processing, testing and verification create more facts. Just putting facts together creates content.


Re: corpus content, I think we're talking past each other. I'm saying that current models aren't being blindly trained on untrusted blogspam, and that there's a lot of work done to verify, structure, transform, etc. And earlier models were trained with lower-quality content, as companies were trying to figure out how much scale mattered. Now they're paying huge amounts of money to improve the quality of what they ingest, to better shape the quality of output. What they take from Reddit, they're not blindly ingesting every comment from every user. My overall main point stands: we have viable, working, scalable mechanisms to avoid the "pollution" you're worried about.

> What. Come on, I think you wouldnt agree with your own statement after reading it once more -Factual correctness is not a matter of degrees.

Of course it is. An LLM can be correct 30% of the time, 80% of the time, 95% of the time, 99% of the time. If that's not a matter of degrees, I don't know what is. If you're looking for 100% perfection, I think you'll find that not even humans can do that. ;)


> I think we're talking past each other

Likely.

Re: > we have viable, working, scalable mechanisms to avoid the "pollution" you're worried about.

Do note - it’s the scalable mechanisms that I am looking at. I dont think the state of the art has shifted much since the last paper by OpenAI.

Can you link me to some new information or sources that lend credence to your claim.

> An LLM can be correct 30% of the time, 80% of the time, 95%…

That would be the error rate, which can be a matter of degrees.

However factual correctness largely cannot - the capital of Sweden today is Stockholm, with 0% variation in that answer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: