Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you were wondering what they trained it on:

> The AFM pre-training dataset consists of a diverse dataset consists of a diverse and high quality data mixture. This includes data we have licensed from publishers, curated publicly-available or open-sourced datasets, and publicly available information crawled by our web-crawler, Applebot. We respect the right of webpages to opt out of being crawled by Applebot, using standard robots.txt directives.

The fact that you can opt-out in robots.txt only if you knew to list Applebot months (years?) ago when they started crawling is a little unimpressive.



That's why it's possible to have a default deny rule in robots.txt

    User-agent: *
    Disallow: /
And possibly allow-list the ones you accept. This probably won't change the fact that you may allow a vendor at one point in time, only to realise they changed their crawling use case and has been scraping data for AI training for the past 6 months (before they go public about it).

It can be argued that if you are a server operator, you always know which User-agents are making requests to your resources.


I thought robots.txt was provided an opt out for all web crawlers, not just vendor specific ones? What would be the use case for not using "*" if you didn't want something to be crawled?


Not exactly. It opt outs of respectful web crawlers by any name, or if you specify it, by some name in particular. Misbehaved ones requires a bit of more effort.


robots.txt has always had mechanisms for allowing or denying specific crawlers. Most of the AI labs that crawl the web support something like this now, here are a few relevant examples from the NY Times robots.txt file for example: https://www.nytimes.com/robots.txt

    User-agent: anthropic-ai
    Disallow: /

    User-agent: Applebot-Extended
    Disallow: /

    User-agent: FacebookBot
    Disallow: /

    User-agent: Google-Extended
    Disallow: /

    User-agent: GPTBot
    Disallow: /


If apple hired one billion employees to look at websites and manually write the data they find, they wound naturally not respect robots.txt.


It’s very reasonable to want Google and Bing to index your page for search, but not have your data collected for training models, i think. I’m not familiar with robots.txt to know if it has a whitelisting mechanism


I see a lot of Robots.txt files that use non-wildcards for Allow but what would be the use case for using a non-widlcard for Disallow?


> The fact that you can opt-out in robots.txt only if you knew to list Applebot months (years?) ago when they started crawling is a little unimpressive.

I think this goes wider than Apple. Many site administrators thought robots.txt only was for (dis)allowing crawlers that created search indexes that could give them page hits in exchange, and saw nothing wrong with allowing that.

Now, many companies crawl with another goal in mind. They don’t do that in secret, but announce that in their headers.

Now the question is how much you can blame those new crawlers for doing that. Should they have made an effort to ask site administrators “hey, you allow us to crawl your site. Are you sure you mean that?” or should site administrators have noticed the new crawlers, and have started thinking whether there meant to allow them to get their content? I think it’s a bit of both, find it hard to judge who’s more to blame, but certainly think that newer crawlers are less to blame for this, as sites should by now know that this problem exists.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: