I wrote some scrapers for a customer a few years back. After about a month I got a call that my scraper flagged a certain URL as unreachable, but the URL did work in a browser.
As it turned out: the webserver returned a 500 for every file, but still served it. So the website rendered flawless in a browser.
I still wonder if it was just a badly configured webserver or if the owner did this on purpose to prevents scrapers and search engines.
I wrote a simple Firefox extension mostly for personal use that finds bookmarks that point to unreachable pages.
After publishing it on addons.mozilla.org, I almost immediately got messages that it marked reachable websites as expired.
So instead of just checking the HTTP status code for 404, I now also check if the page content contains strings like "404" or "not found". If it doesn't contain those I mark the bookmark as maybe expired.
How do you deal with languages other than English? I guess one might generate a very unlikely url and then take that response as 'not found'. Still wouldnt work sometimes.
i guess it works like that:
- 404 status with "not found" or "404" - surely not found
- 404 status without above - "maybe" not found
- not-404 status - bookmark is fine :)
Reminds me of this DefCon talk which discussed the effects of returning different HTTP status codes on various vulnerability scanners (without affecting the web browser).
https://www.youtube.com/watch?v=4OztMJ4EL1s
Differentials in how browsers handled weird status codes allowed for fingerprinting.
Differentials in how different automated tools/scanners handled weird status codes allowed for defensive tactics.
As it turned out: the webserver returned a 500 for every file, but still served it. So the website rendered flawless in a browser.
I still wonder if it was just a badly configured webserver or if the owner did this on purpose to prevents scrapers and search engines.