Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
404 Found (tedunangst.com)
80 points by luu on Aug 22, 2019 | hide | past | favorite | 32 comments


I wrote some scrapers for a customer a few years back. After about a month I got a call that my scraper flagged a certain URL as unreachable, but the URL did work in a browser.

As it turned out: the webserver returned a 500 for every file, but still served it. So the website rendered flawless in a browser.

I still wonder if it was just a badly configured webserver or if the owner did this on purpose to prevents scrapers and search engines.


I wrote a simple Firefox extension mostly for personal use that finds bookmarks that point to unreachable pages.

After publishing it on addons.mozilla.org, I almost immediately got messages that it marked reachable websites as expired.

So instead of just checking the HTTP status code for 404, I now also check if the page content contains strings like "404" or "not found". If it doesn't contain those I mark the bookmark as maybe expired.


How do you deal with languages other than English? I guess one might generate a very unlikely url and then take that response as 'not found'. Still wouldnt work sometimes.


i guess it works like that: - 404 status with "not found" or "404" - surely not found - 404 status without above - "maybe" not found - not-404 status - bookmark is fine :)


Reminds me of this DefCon talk which discussed the effects of returning different HTTP status codes on various vulnerability scanners (without affecting the web browser). https://www.youtube.com/watch?v=4OztMJ4EL1s

Differentials in how browsers handled weird status codes allowed for fingerprinting. Differentials in how different automated tools/scanners handled weird status codes allowed for defensive tactics.


I have seen HTTP 500 return codes on webservers where outputting errors was suppressed, but some error occured anyway while processing the page.

Looked fine in a browser, but not an ideal case for SEO...


I have seen that too on a web API for an enterprise system. 500 meant OK. Sometimes it threw other codes which then sometimes meant error.


With respect to Ted, there’s virtually nothing of content in this post... I’m confused. 404 SEO issues are nothing new, and “friendly” 404 browser intercepts have been discussed much more coherently elsewhere. I don’t at all mind discussions bringing up something other people might already know, but this doesn’t really delve into anything and merely mentions the existence of this issue. A comment from OP (or anyone upvoting this) explaining what they found interesting here would be helpful.


It seems like Firefox is trying to be helpful. They show the content for a dead project ("Support the sites you love, avoid the ads you hate ..." ) with a header explaining that "This study is no longer active. Thank you for your participation." That's arguably much better than an enigmatic 404 error.


> with a header explaining that "This study is no longer active. Thank you for your participation."

I think that header is way too easy to miss, because it occupies the same space as the usual pointless cookie or "sign up"-type banners that many websites show. I certainly didn't see it at first.

I suspect that the author of the linked blog post did not see that message either, since they describe that page being a 404 as "probably a bug".


I saw the banner and was equally confused - a 404 seems like the wrong abstraction layer for that content, just intuitively.


I found the author's thoughts on the coexistence of machine- vs human-readability interesting, and hadn't encountered a case like that Firefox page before.


This is a potential use case for 410 Gone: there used to be content, but not anymore, and it is unlikely to reappear in the future, so you can cache this response and not bother trying to fetch it again. Of course, 410 is only appropriate if you can be fairly sure that you don't want to reuse the URI in the future.


Interesting idea. It would also need to be the case that the content has no moved either so a 302 is not appropriate.

I don't necessarily agree that you should be certain not to reuse the URI. Why do you think that should be the case?


It depends, really, but I seem to recall that at least some browsers ignore cache headers on 410 responses and always cache them ”forever”, which is arguably allowed by the spec.

As a cautionary example, I once was trying to be fancy and used 410 to denote the expiration of a session-bound resource (actually an API endpoint). That would have been fine had the resource URI been unique across sessions… but it wasn’t, so after one session expiration some browsers naturally assumed that the endpoint URI isn’t going to come back even after starting a new session. Should have used 404 or 403 instead.


Because the response will be cached by the browser for a period of time.



Years ago I had a problem serving downloads with PHP. IIRC it worked just fine on Opera (Presto-based) and was showing blank page on Firefox.

The problem was serving 404 Not Found status with `Content-Disposition: attachment` and actual contents of the file. Opera hadn't had a problem with that, Firefox was confused.

If you would like to test your browser behavior, here is replicated behaviour of what i've done in the past: http://o7o.pl/down404.php

My results:

- Chromium-based browsers (tested on Chrome, Vivaldi, Opera Developer) shows generic Chromium error with `ERR_INVALID_RESPONSE` code.

- Firefox displays own page about resource not being found

- Edge (not Chromish) removes the url (if opened in new tab) and shows infinite loader in the tab favicon or just restores previous url (if going to the url from another page)

- Old Presto-based Opera (newest from 12.x) just downloads the file

- wget just returns 404 error

- Internet Explorer 11 shows own "page cannot be found" page

Can anyone test it on Safari on Mac?

I wonder which should be the correct behavior? Personally I am satisfied with just downloading the file ignoring 404 status.


> Personally I am satisfied with just downloading the file ignoring 404 status.

That would be convenient, and follows the "deal with it or ignore it and render what you can" pattern followed by browsers in the face of malformed HTML, but...

> I wonder which should be the correct behavior?

I'd argue that displaying the "not found" message, and aborting any other transfer, would be the more correct thing to do, despite being less convenient. Something thinks there is an error condition while responding the request, so can you really trust any content returned by the response to be as it should be?


- Safari shows a blank page, downloads it as a file, and auto-displays the content in TextEdit


> auto-displays the content in TextEdit

wow, that's kind of "brave" thing to do for a file marked as `application/binary` mime type. it has `.txt` extension however as filename

of course I have no idea how TextEdit behaves with big binary files, but such apps on Windows/Linux can't handle it well (usually hangs forever/for a long time)


MacOS/Safari has an option to automatically open "safe" files, like text files and other media.


PDFs and ZIPs are also considered as safe despite the format being very complex and huge chances there's an exploitable bug in there.

That's one of the very first things I disable on any Mac.


Safari just downloaded a text file for me when I clicked the link. Nothing else happened.


Would be neat if there was a 404-blocker extension to take over the 404 "bling space."

For example, in Firefox the extension could replace 404 bling content with randomly chosen little nostalgic animations and concept art based on the history of the web. It could just ship with a small collection of such stuff without much of a footprint so there's no network hit.

Then


Replace all 404 errors with 404 Party

https://www.youtube.com/watch?v=qvwwzV6ruGc


and then????????


You missed the chance to set the response code of the post to 404 instead of 200, I was very disappointed.


the website is down :) here's a copy: https://webcache.googleusercontent.com/search?q=cache:yli5sn... (click on text version link if it cannot load anyway)


Maybe it's 404, because: "This study is no longer active. Thank you for your participation."


as article stated - there is a difference between opening that link with and without trailing slash.

without it - it serves exactly the same page with 200 OK (and redirects you to slash url but with javascript, not with Location header)

So i guess there is just a little server misconfiguration mixed with javascript application taking the opposite way of thinking about urls than the server


Ah, ok. Trailing slash can be a source of "fun" in webserver configurations, yes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: