Google Webmaster Tools: I don't understand them.

I’ve seen a few hits on my site to http://zhasper.com/user/, or pages underneath. This seems to be because there used to be content there, and Google’s cache hasn’t (or hadn’t at the time anyway – it seems to have mostly caught up now).

I don’t want this, so I went to the “Remove URLs” tool, under Tools in the Webmaster Console.

The page says:

Before you begin, you must make sure that Google and other search engines will not crawl the content you want to remove from our search results.

To do this, ensure that each page returns an HTTP status code of either 404 or 410, or use a robots.txt file or meta noindex tag to block crawlers from accessing your content.

Okay, so it needs to return a 404. Easy – there’s no content there anyway, it’s already returning a 404. Double-check:

zhasper@bridgitte:~$ wget http://zhasper.com/user/

--2008-12-25 17:03:31--  http://zhasper.com/user/

Resolving zhasper.com... 88.198.1.123

Connecting to zhasper.com|88.198.1.123|:80... connected.

HTTP request sent, awaiting response... 404 Not Found

2008-12-25 17:03:31 ERROR 404: Not Found.

Excellent. So, I request the whole directory to be removed from the index.

Some days later, I come back and check, and my request for removal has been denied. There’s a little question mark beside the word denied, obviously further details, so I click on it:

Your request has been denied because the webmaster of the site hasn’t applied the appropriate robots.txt file or meta tags to block us from indexing or archiving this page.

No shit – I didn’t put anything in robots.txt because it’s returning a 404, and your instructions say that’s all that’s needed.

Grrr.

I *think* that everything under /user/ has been removed (there’s certainly nothing in the index any more), it’s just /user that’s not been removed. I don’t understand this – /user gives a 404 also, and the content shown in the snippet is the old Drupal content.

(obdisc: this is a private blog, all opinions are my own and not those of my employer, who happens to be Google. There’s probably something obvious that I’m overlooking – hopefully I’ll have another blog post soon with an update on what that is)

Udpate, 5 minutes later: Duh. Read the next paragraph, idiot:

If you’re requesting removal of a full site or directory, you must use a robots.txt file to block crawlers from accessing this content.

I’m requesting removal of a full directory. So….

Leave a Reply