- 51
- Sphinn It!
Posted By: ColinCochrane 201 days ago
Topic Type: News Story (Jump to http://www.colincochrane.com)
Category: Yahoo Searching
This isn't a simple robots.txt exclusion either, but rather a 404 response that is now being served based on the requesting User-Agent.
This raises some questions as to the intentions of del.icio.us, and perhaps Yahoo! With Yahoo! recently integrating del.icio.us bookmarks into its search results this could an attempt to enhance the effectiveness of that new feature by preventing competing search engines from indexing content from del.icio.us.
14 Comments


Comments
hmm.. del.icio.us already nofollows the external links...
shows that Yahoo is seriously considering Microsofts bid. To keep the buyer happy, Yahoo needs to proove that they can also be evil!
That's not new. Surfing del.ico.us with a crawler user name was a PITA for ages. Did you check it from a SE's IP? I found a few pretty recent cached page copies at Google.
Sebastian,
I didn't check from a SE IP, no. I did hunt through a few hundred pages from a site command on del.icio.us and didn't find any pages with a cache date newer than February 13th.
I don't see why that they would disallow the User-Agents via robots.txt, and send 404 resonses to requests from that User-Agent, but allow those spiders based on IP, though.
I wondered why my recently created profile was not showing up in the engines, they used to rank v-high.
I'm pretty sure Sebastian is right. If it were blocking them, then wouldn't that also disallow the Google translation bot?
http://del.icio.us/mvandemar (in Russian)
The robots.txt isn't actually blocking the spiders. It is in the same format it has been in since at least December:
robots.txt (from December 24th)
I don't see why that they would disallow the User-Agents via robots.txt, and send 404 resonses to requests from that User-Agent, but allow those spiders based on IP, though.
If you look, there is an "Allow: /" after each of the bot identifiers, which would override the Disallow at the top of the file. The stuff that they are disallowing is stuff they have always disallowed, things that don't need to get indexed, like inboxes and searches.
As far as I know the Google translation bot isn't used for indexing. If the goal here is to simply prevent competitors from indexing content on del.icio.us then blocking the translation bot wouldn't seem to be a priority.
They still didn't block anything important with robots.txt... all of the stuff that matters is available. They aren't blocking anything new.
The issue here isn't the robots.txt. To quote the my initial post:
"This isn't a simple robots.txt exclusion either, but rather a 404 response that is now being served based on the requesting User-Agent."
Colin, Sebastian suggested that they were only blocking people form non-search engine IP blocks who spoofed the user agents, you replied that it made no sense to allow from those IP's if they were blocked by robots.txt. I point out that the robots.txt isn't blocking the pages you were concerned with, and you say it's not about the robots.txt.
You're going in circles now. I mean, why would they allow them by robots.txt, but then block them just by user agent?
Did you see http://del.icio.us/robots.txt?
Just trying to access their site with user agents for those top 4 bots is MEANINGLESS because people serious about their web crawling security, which they appear to be, block based on IP range and/or full trip DNS validation.
Therefore, you can never tell with any test, no matter how hard you try, what that site is actually telling the real Googlebot unless you have an actual Googlebot IP address in the range of an actual Googlebot crawler that has a reverse DNS that responds with "crawl-xx-xxx-xx-xxx.googlebot.com"
However, you are correct that the latest cache date is a few days old, but it's way too early to tell what's really going on at this point.
My mistake Micheal, I got a bit mixed up. The discrepancy between the robots.txt and the User-Agent block is what I was trying to highlight.
Sebastian made a comment on my blog that bears repeating: "If in a week or so we can't find crawler fetches after Feb/13 that's worth further investigation."
It will be interesting to see if there are any updated caches over the next fews days.
They've been doing some different stuff to deal with bad bots for quite a while:
http://seoblog.intrapromote.com/2006/08/delicious_cloak.html
I doubt that this is anything but a different approach to what is essentially the same set of problems.
A social bookmarks manager. Using bookmarklets, you can add bookmarks to your list and categorize them.
del.icio.us/ - 22k - 8 hours ago - Cached - Similar pages
To bad if so, because it really helped ranking sites.