Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What astounds me is there are no readily available libraries crawler authors can reach for to parse robots.txt and meta robots tags, to decide what is allowed, and to work through the arcane and poorly documented priorities between the two robots lists, including what to do when they disagree, which they often do.

Yes, there's an ancient google reference parser in C++11 (which is undoubtedly handy for that one guy who is writing crawlers in C++), but not a lot for the much more prevalent Python and JavaScript crawler writers who just want to check if a path is ok or not.

Even if bot writers WANT to be good, it's much harder than it should be, particularly when lots of the robots info isn't even in the robots.txt files, it's in the index.html meta tags.



robots.txt support is built into the Python stdlib as urllib.robotparser: https://docs.python.org/3/library/urllib.robotparser.html

rel=nofollow is a bad name. It doesn’t actually forbid following the link and doesn’t serve the same purpose as robots.txt.

The problem it was trying to solve was that spammers would add links to their site anywhere that they could, and this would be treated by Google as the page the links were on endorsing the page they linked to as relevant content. rel=nofollow basically means “we do not endorse this link”. The specification makes this more clear:

> By adding rel="nofollow" to a hyperlink, a page indicates that the destination of that hyperlink should not be afforded any additional weight or ranking by user agents which perform link analysis upon web pages (e.g. search engines).

> nofollow is a bad name […] does not mean the same as robots exclusion standards

https://microformats.org/wiki/rel-nofollow


Thanks for this!


The "good" bot writers rarely have enough resources to demolish servers blindly, and are generally more careful whether or not you make it easier, so there's not much incentive.


I don't see a reason why a good bot operator couldn't build a parser lib in a different language and put it on a public repo.

Shouldn't be that hard if someone WANT to be good.


Sure, but it's always easier to use a tool that's been tried and tested.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: