If you were wondering what they trained it on: > The AFM pre-training dataset co...

Jiocus · on Aug 9, 2024

That's why it's possible to have a default deny rule in robots.txt

    User-agent: *
    Disallow: /

And possibly allow-list the ones you accept. This probably won't change the fact that you may allow a vendor at one point in time, only to realise they changed their crawling use case and has been scraping data for AI training for the past 6 months (before they go public about it).

It can be argued that if you are a server operator, you always know which User-agents are making requests to your resources.

seanmcdirmid · on Aug 9, 2024

I thought robots.txt was provided an opt out for all web crawlers, not just vendor specific ones? What would be the use case for not using "*" if you didn't want something to be crawled?

gmuslera · on Aug 9, 2024

Not exactly. It opt outs of respectful web crawlers by any name, or if you specify it, by some name in particular. Misbehaved ones requires a bit of more effort.

simonw · on Aug 9, 2024

robots.txt has always had mechanisms for allowing or denying specific crawlers. Most of the AI labs that crawl the web support something like this now, here are a few relevant examples from the NY Times robots.txt file for example: https://www.nytimes.com/robots.txt

    User-agent: anthropic-ai
    Disallow: /

    User-agent: Applebot-Extended
    Disallow: /

    User-agent: FacebookBot
    Disallow: /

    User-agent: Google-Extended
    Disallow: /

    User-agent: GPTBot
    Disallow: /

indigo0086 · on Aug 9, 2024

If apple hired one billion employees to look at websites and manually write the data they find, they wound naturally not respect robots.txt.

atty · on Aug 9, 2024

It’s very reasonable to want Google and Bing to index your page for search, but not have your data collected for training models, i think. I’m not familiar with robots.txt to know if it has a whitelisting mechanism

seanmcdirmid · on Aug 9, 2024

I see a lot of Robots.txt files that use non-wildcards for Allow but what would be the use case for using a non-widlcard for Disallow?

Someone · on Aug 9, 2024

> The fact that you can opt-out in robots.txt only if you knew to list Applebot months (years?) ago when they started crawling is a little unimpressive.

I think this goes wider than Apple. Many site administrators thought robots.txt only was for (dis)allowing crawlers that created search indexes that could give them page hits in exchange, and saw nothing wrong with allowing that.

Now, many companies crawl with another goal in mind. They don’t do that in secret, but announce that in their headers.

Now the question is how much you can blame those new crawlers for doing that. Should they have made an effort to ask site administrators “hey, you allow us to crawl your site. Are you sure you mean that?” or should site administrators have noticed the new crawlers, and have started thinking whether there meant to allow them to get their content? I think it’s a bit of both, find it hard to judge who’s more to blame, but certainly think that newer crawlers are less to blame for this, as sites should by now know that this problem exists.