Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wrote some scrapers for a customer a few years back. After about a month I got a call that my scraper flagged a certain URL as unreachable, but the URL did work in a browser.

As it turned out: the webserver returned a 500 for every file, but still served it. So the website rendered flawless in a browser.

I still wonder if it was just a badly configured webserver or if the owner did this on purpose to prevents scrapers and search engines.



I wrote a simple Firefox extension mostly for personal use that finds bookmarks that point to unreachable pages.

After publishing it on addons.mozilla.org, I almost immediately got messages that it marked reachable websites as expired.

So instead of just checking the HTTP status code for 404, I now also check if the page content contains strings like "404" or "not found". If it doesn't contain those I mark the bookmark as maybe expired.


How do you deal with languages other than English? I guess one might generate a very unlikely url and then take that response as 'not found'. Still wouldnt work sometimes.


i guess it works like that: - 404 status with "not found" or "404" - surely not found - 404 status without above - "maybe" not found - not-404 status - bookmark is fine :)


Reminds me of this DefCon talk which discussed the effects of returning different HTTP status codes on various vulnerability scanners (without affecting the web browser). https://www.youtube.com/watch?v=4OztMJ4EL1s

Differentials in how browsers handled weird status codes allowed for fingerprinting. Differentials in how different automated tools/scanners handled weird status codes allowed for defensive tactics.


I have seen HTTP 500 return codes on webservers where outputting errors was suppressed, but some error occured anyway while processing the page.

Looked fine in a browser, but not an ideal case for SEO...


I have seen that too on a web API for an enterprise system. 500 meant OK. Sometimes it threw other codes which then sometimes meant error.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: