Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"archive.org" obeys "robots.txt". Over-obeys, in fact. If you add a robots.txt file that locks them out, old archives disappear, too. Even if the domain name has changed hands.


> A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.

It might not be true forever. Unfortunately.

https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...


This is an important and useful improvement. Many domain parkers/squatters/etc who snap up dead domains have robots.txt files that block everything or almost everything, breaking the ability to see the previous site via archive.org.

(Side note: domain name expiration was a mistake.)


A fun thing to do before your website expires is set up HSTS with "includeSubDomains" and enroll your website in HSTS preload. Many of the bots that backorder domains in order to put ad pages on them don't use SSL at all (not even LetsEncrypt) and the domain ends up becoming useless for them.


If they ignore robots.txt, than what else gives them the right to copy and host content from other sites? As much as I value Wayback and archive.org, I think putting this into the realm of bilateral negotiation and a DMCA-like model outside courts is a slippery slope. It's a non-solution potentially breeding new monopolies, like Google's exclusive relations with news publishers is doing. Is there nothing in HTML metadata (schema.org etc.) informing crawlers and users about usage rights that could be lifted or extended for this purpose now, especially now that the EU copyright reform has set a legal framework and recognition of principles in the attention economy?


> If they ignore robots.txt, than what else gives them the right to copy and host content from other sites?

The same thing that gives them the right otherwise: fair use, and explicit archiving exceptions written into copyright law. robots.txt adds no additional legality.


Fair use does not give you the right to wholesale scrape content that is otherwise under copyright with a non-CC/open license, which is effectively what the Internet Archive does. (To be clear, I approve of IA's mission but it's in a legal grey area.)

robots.txt has never had much of a legal meaning. Respecting it was mostly a defense along the lines of "You only have to ask, even retrospectively, and we won't copy your content." As a practical matter, very few are going to sue a non-profit to take down content when they pretty much only have to send an email, (almost) no questions asked.


> Fair use does not give you the right to wholesale scrape content

Yes, it potentially does. There are court cases establishing precedent that copying something in its entirety can still be fair use, as well as law and court cases establishing specific allowances for archives/libraries/etc.


There's probably an argument where archiving a particular site as a whole has some compelling public interest--say a politician's campaign site. But it seems unlikely that would extend to randomly archiving (and making available to the public) web sites in general.

I've always been told that fair use--as a defense against a copyright infringement claim--is very fact dependent.


IANAL, but I fail to see how fair use can be leveraged to give archive sites a right to host other site's content when that content is available publically und non-discriminatory, and there are eg. Creative Common license metadata tags for giving other sites explicit and specific permissions to re-host content. There are also concerns to be addressed under EU copyright reform (eg. preview of large portions of text from other sites without giving those other sites clicks). If your point is that content creators can't technically or "jurisdictionally" stop archival sites from rehosting, then the logical consequence is that content creators need to look at DRM and similar draconic measures which I hope they rather aren't forced to do.


how does it work for all the content that is not under fair use because the author comes from a country with different laws?


The author's jurisdiction is irrelevant. The only question is what jurisdiction's laws apply to the Internet Archive (or in general whatever party does the copying).


Then that country can try to enforce its laws on the Internet Archive. Won't be easy, though.


Besides the aforementioned gov/mil sites that archive.org announced a changed policy on, I noticed during the TurboTax controversy [0] that archive.org seems to have disobeyed robots.txt (2008) [1a,1b] and NOARCHIVE (2006) [2] for TurboTax as well. It wouldn't surprise me if Archive.org retroactively changed this because of the controversy, since they manually changed display settings during the Jo Reid [3] controversy. But the latter case involved a webmaster who set robots.txt to wipe out old pages, whereas Turbotax have had robots.txt + NOARCHIVE as a policy for years. I had assumed Archive.org wouldn't even originally store such pages during the crawl, but that seems not to be the case?

[0] https://news.ycombinator.com/item?id=19758126

[1] http://web.archive.org/web/20080512041438/http://turbotax.in...

http://web.archive.org/web/20080512004753/http://turbotax.in...

[2] https://twitter.com/dancow/status/1123762475244163078

[3] https://blog.archive.org/2018/04/24/addressing-recent-claims...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: