Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One way to easily bypass is to let external services fetching robots.txt (archive.org, GitHub actions, etc...) to cache it and either expose through separate apis/webhook/manual download to the actual scrape server.

robots txt file size is usually small and would not alert external services.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: