I wrote some scrapers for a customer a few years back. After about a month I got...

evilpie · on Aug 22, 2019

I wrote a simple Firefox extension mostly for personal use that finds bookmarks that point to unreachable pages.

After publishing it on addons.mozilla.org, I almost immediately got messages that it marked reachable websites as expired.

So instead of just checking the HTTP status code for 404, I now also check if the page content contains strings like "404" or "not found". If it doesn't contain those I mark the bookmark as maybe expired.

MichaelMoser123 · on Aug 22, 2019

How do you deal with languages other than English? I guess one might generate a very unlikely url and then take that response as 'not found'. Still wouldnt work sometimes.

dzek69 · on Aug 22, 2019

i guess it works like that: - 404 status with "not found" or "404" - surely not found - 404 status without above - "maybe" not found - not-404 status - bookmark is fine :)

rgoulter · on Aug 22, 2019

Reminds me of this DefCon talk which discussed the effects of returning different HTTP status codes on various vulnerability scanners (without affecting the web browser). https://www.youtube.com/watch?v=4OztMJ4EL1s

Differentials in how browsers handled weird status codes allowed for fingerprinting. Differentials in how different automated tools/scanners handled weird status codes allowed for defensive tactics.

dawolf- · on Aug 22, 2019

I have seen HTTP 500 return codes on webservers where outputting errors was suppressed, but some error occured anyway while processing the page.

Looked fine in a browser, but not an ideal case for SEO...

Ididntdothis · on Aug 22, 2019

I have seen that too on a web API for an enterprise system. 500 meant OK. Sometimes it threw other codes which then sometimes meant error.