Published: Feb 18, 2008 - 07:40 am
Story Found By: ColinCochrane 1922 Days ago
Category: Searching
This isnt a simple robots.txt exclusion either, but rather a 404 response that is now being served based on the requesting User-Agent.
This raises some questions as to the intentions of del.icio.us, and perhaps Yahoo! With Yahoo! recently integrating del.icio.us bookmarks into its search results this could an attempt to enhance the effectiveness of that new feature by preventing competing search engines from indexing content from del.icio.us.
15 Comments




Comments
hmm.. del.icio.us already nofollows the external links...shows that Yahoo is seriously considering Microsofts bid. To keep the buyer happy, Yahoo needs to proove that they can also be evil!
Thats not new. Surfing del.ico.us with a crawler user name was a PITA for ages. Did you check it from a SEs IP? I found a few pretty recent cached page copies at Google.
Sebastian,I didnt check from a SE IP, no. I did hunt through a few hundred pages from a site command on del.icio.us and didnt find any pages with a cache date newer than February 13th.I dont see why that they would disallow the User-Agents via robots.txt, and send 404 resonses to requests from that User-Agent, but allow those spiders based on IP, though.
I wondered why my recently created profile was not showing up in the engines, they used to rank v-high.
Im pretty sure Sebastian is right. If it were blocking them, then wouldnt that also disallow the Google translation bot?http://del.icio.us/mvandemar (in Russian)The robots.txt isnt actually blocking the spiders. It is in the same format it has been in since at least December:robots.txt (from December 24th)I dont see why that they would disallow the User-Agents via robots.txt, and send 404 resonses to requests from that User-Agent, but allow those spiders based on IP, though.If you look, there is an "Allow: /" after each of the bot identifiers, which would override the Disallow at the top of the file. The stuff that they are disallowing is stuff they have always disallowed, things that dont need to get indexed, like inboxes and searches.
As far as I know the Google translation bot isnt used for indexing. If the goal here is to simply prevent competitors from indexing content on del.icio.us then blocking the translation bot wouldnt seem to be a priority.
They still didnt block anything important with robots.txt... all of the stuff that matters is available. They arent blocking anything new.
The issue here isnt the robots.txt. To quote the my initial post:"This isnt a simple robots.txt exclusion either, but rather a 404 response that is now being served based on the requesting User-Agent."
Colin, Sebastian suggested that they were only blocking people form non-search engine IP blocks who spoofed the user agents, you replied that it made no sense to allow from those IPs if they were blocked by robots.txt. I point out that the robots.txt isnt blocking the pages you were concerned with, and you say its not about the robots.txt.Youre going in circles now. I mean, why would they allow them by robots.txt, but then block them just by user agent?
Did you see http://del.icio.us/robots.txt? Just trying to access their site with user agents for those top 4 bots is MEANINGLESS because people serious about their web crawling security, which they appear to be, block based on IP range and/or full trip DNS validation.Therefore, you can never tell with any test, no matter how hard you try, what that site is actually telling the real Googlebot unless you have an actual Googlebot IP address in the range of an actual Googlebot crawler that has a reverse DNS that responds with "crawl-xx-xxx-xx-xxx.googlebot.com"However, you are correct that the latest cache date is a few days old, but its way too early to tell whats really going on at this point.
My mistake Micheal, I got a bit mixed up. The discrepancy between the robots.txt and the User-Agent block is what I was trying to highlight.Sebastian made a comment on my blog that bears repeating: "If in a week or so we cant find crawler fetches after Feb/13 thats worth further investigation."It will be interesting to see if there are any updated caches over the next fews days.
Theyve been doing some different stuff to deal with bad bots for quite a while:http://seoblog.intrapromote.com/2006/08/delicious_cloak.htmlI doubt that this is anything but a different approach to what is essentially the same set of problems.
del.icio.usA social bookmarks manager. Using bookmarklets, you can add bookmarks to your list and categorize them.del.icio.us/ - 22k - 8 hours ago - Cached - Similar pages
To bad if so, because it really helped ranking sites.
As already pointed out, del.icio.us is NOT blocking the Googlebot from indexing any user pages / pages worth being indexed. Theyre only blocking spider spoofers and admin type pages.