Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Ultimate Guide To Duplicate Content (david-whitehouse.org)
21 points by davidwhitehouse on April 6, 2011 | hide | past | favorite | 10 comments


Interesting that you mentioned it is 8 words in a row. The usual way duplicate content is checked for is using a moving hash. I had one that I put up online a few years ago.

Other content that should be checked is the printable versions should not be indexed. Wordpress also generates some crawlable urls that cause the same content to be returned, e.g. archives, categories.

Presence of session ids in urls used to be a big problem, but search engines seems to have matured and a bit cleverer. However, if you aren't using a well known CMS, then it is better to make sure you link to a canonical version and don't put the one with session id in the search engine index.


Agree with @bauchidgw - http://news.ycombinator.com/item?id=2417177 - there's a few missing...(camel-case URLs, capitalized URLs, folder re-ordering, rewritten/not redirected URLs, root vs index.xxx etc) Also missing (considering this is meant to be the 'Ultimate Guide' is a comprehensive definition of duplicate content, consisting of "appreciably similar content between one uniquely accessible URI and another", along with canonicalization and more.

But a good start non-the-less for those who keep tripping up on this simple to fix issue.


"www vs non-www"

I've heard this before, but I have a hard time believing it. Google really can't figure out on its own that for some sites the www is optional?


I don't think they can, unless you submit both to Google Webmasters Tools and then select a default domain to show.

Besides, why bother risking it when it's a 5 minute job?


Because the internet at large is not going to do that 5 minutes of work - just like with web standards, it's up to browsers to support the decade-of-html that was produced before anyone gave a crap about the w3c.


Googlebot is a not a browser. It's a distinct "user" of your site with special needs.

Think of Google as a retarded 30 year old user of site. If 67% of your users relied on the opinion of that single user, you'd damn sure do 5 mins worth of work to help him along.

And if you didn't it would be your own fault for failing to get his reference (read: ranking).


Google's indexing problems remain Google's problems though - the internet at large has demonstrated again and again that large portions are not going to update, whether it's Flash video, malformed HTML, canonical issues, duplicate content issues, dependencies on JavaScript or worse specific versions of specific browsers, etc etc.

Just like with browser vendors fixing bad HTML themselves in their rendering engines this is something Google has to fix themselves, because it's one fix for the whole internet instead of 11 billion fixes being delegated to people who just aren't going to do it.

It's a shitty job but they volunteered for it. And make billions doing it.

I am sure there are massive sections of the internet that are never going to get more than incidental traffic from Google - all the sites that just don't know/care about SEO, not to mention all the ones that just can't outdo the adsense and affiliate spammers and content farms.


"a retarded 30 year old user of site" Whoa there. I hope you meant 3.


missed a few:

mixedcase urls (on a windows server), ending slash vs no ending slash, double slashes in url, too much border (nav, footer, right hand side) html on pages with otherwise minor content, indexed ip address urls, additional url (i.e.: tracking parameters), .... and these are just on top of my head


I got the IP address URLs - those are usually staging servers, I also was going to mention tracking parameters with Google Analytics as an example, but it really isn't that common.

Definitely should have added the slash vs no slash one though, will add later, feel free to add them yourself if you like ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: