Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have no association with this person and just heard about this. But obviously those are search term result pages and not "autogenerated pages" in the usual sense. If Google is going to apply such a standard uniformly, they'd have to ban all sites that have search functions.

Autogenerated in the bad sense is link farms like fakesite.com/buy_drugwiththisname_now.html, with drugwiththisname replaced with 500,000 different possibilities, and all generating pages linked to each other, which also have incoming links from ten billion forum comments across the web where a bot has signed up en masse and posted spam.

Now, perhaps this guy is posting spam to his site in places which makes it valid to declare this a link farm and kill it.

Looking at the site now, being able to search amazon by exact price is a pretty neat function and is totally different from a link farm.

I think killing his site for having a search function is pretty unreasonable.

However, you are a private for-profit company, so you can do as you please obviously.



> Autogenerated in the bad sense is link farms like fakesite.com/buy_drugwiththisname_now.html, with drugwiththisname replaced with 500,000 different possibilities

Actually there is no difference between search page result sites like the OP and what you said. Having a URL that ends in .html does not mean there is a static html file on the server. For examples, nextag and pricegrabber have URLs like blah.com/digital/Canon-EOS-7D-SLR-Digital-Body/m739295014.html. Scroll around these sites and you'll see they're anything but static.

Each page is simply a result of a query. Whether the database is local (like nextag) or remote (like Amazon API query like OP) is inconsequential. Personally, I would be VERY happy if Google hid/ignored these kind of search-query sites. I have not once found any of these sites useful. Problem is the large gray area in which these sites operate and what I find useless, someone else might find useful.


Would be easy for the google bot to work this out to, replace the URL part that looks like a dynamic variable with different values and see if it returns a page for them all like a search engine would. Surprised it doesn't appear to do this.


> Actually there is no difference between search page result sites like the OP and what you said

Uh, yes there is. http://www.somesite.com/index.html?q=foo is obviously a search (right down to the choice if q for query), but http://www.somesite.com/buy_foo_now.html does not look like one.

You might say that technically they could have the same back-end, but the web is flexible - just about any URL scheme could resolve to the same back-end. The difference is that the first is honest and the second one goes out of its way to hide what it does.


> Uh, yes there is. http://www.somesite.com/index.html?q=foo is obviously a search

Obviously? Drupal's default non-rewritten URLs are index.php?q=foo, where foo is the page path. `q` can and does stand for more than one thing.


Search results are also not supposed to be indexed. I believe this is actually in the guidelines.


Yup. Search for [quality guidelines] and they're at http://www.google.com/support/webmasters/bin/answer.py?answe... . The relevant part is "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."


Matt, I tried that recently with one of my own sites, and after a few days I got a warning in my Google Webmaster Tools dashboard that the bot could not access my /search URL because it was blocked in my robots.txt, and I should take action to correct it. So I then unlocked my /search URL, which is in violation of the above guideline, but made the error in Google Webmaster Tools go away.

These conflicting messages from Google are very confusing!


Hi j_col-

In that case, Google Webmaster tools is not actually reporting an error. That's a report to show you what URLs Google tried to crawl but couldn't (due to being blocked) so you can review it and ensure that you are not accidentally blocking URLs that you want to have indexed.

I agree that it's confusing in that the report is in the "crawl errors" section.

(I built Google webmaster tools so this confusion is entirely my fault; but I don't work at Google anymore so sadly I can't fix this.)


Thanks for the response, I will block my /search URL once more via robots.txt and will ignore the warnings in the Webmaster Tools.


Literally Google Webmaster Guidelines say: "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."

So a couple of questions: - What is of value for a user? - Who is determining the value for the users in these cases?

As always, it's not always quite clear on what treatment you should use for search pages!


I'm pretty sure that having different search engines indexing each other would be bad. Don't cross the streams.


If I recall correctly, crossing the streams did kill marshmallow man!

In any case, I agree with you, search engine indexing search results would be bad, but the line is not that clear all the time!

Some vertical search engine result pages are a great and relevant result from a user perspective on the question they are trying to solve.


It's hard to draw the line on what is search and what isn't. If you use a system where the last part of the URL will be treated as a search but only certain ones are ever linked to so they are a product page.

Also if you have some kind of recent searches list and those link.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: