Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Google has been crawling the web. I think it's about limiting to 1Hz per IP. This means a site with 100 000 pages will take 24 hours to crawl. The crawler could send requests as fast as possible, just distributed across IPs.

If I put a website up, everything goes as long as people don't DDoS me. A human crawls at ~1Hz, if a bot tries 1000Hz, this is a denial of service attack. It's hard to block since you can't rate limit IPs due to many people sharing the same IP. So you need heuristics, cookies, etc.

Putting paywalled content in the AI is not cool though (such as books), nobody was anticipating this, people got effed unexpectedly. This is piracy on hyperscale. Not fair.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: