Sorry this site requires JavaScript to be enabled in your browser. See the following guide on How to enable JavaScript in Internet Explorer, Netscape, Firefox and Safari. Alternatively you may be blocking JavaScript with an advert-related or developer plugin. Please check your browser plugins.

A couple of days ago Micheal posted his assertion that Rand Fishkin had lied about the details of the new Linkscape tool on SEOmoz. This caused Sphinn comments to be closed and a lot of fireworks. Now, read more about the bots that never were on SEO Smackdown!
Comments17 Comments  

Comments

Avatar
from IncrediBILL 1315 Days ago #
Votes: -1

I hate to debunk part your post, but when I busted the story that it was DotBot behind Lnkscape I also published that DotBot had been seeded with data from another search engine, possibly Amazon’s AWS.How I know this is because DotBot has never been allowed to crawl my site yet DotBot knew of internal pages in advance that it had never seen which leads to two options a) DotBot is clairvoyant or b) DotBot was pre-loaded with data from another source to seed the crawl.Lastly, don’t think linearly, that a server is only getting 1 page per second. Most search engines are multi-threaded and pull down 100s of pages per second which make it more possible with a small bank of servers so starting with a seeded database and running multi-threaded you could pull off a pretty good sized index in a couple of months.It’s a simple matter of how much bandwidth you throw at the problem.However, I’ll back up your math at the moment based on DotBot’s stated stats.Claims to have come online June 10th 2008 and appears to crawl about a page a second based on the displayed stats so it’s been running approximately about 130 days. At that crawl rate it’s roughly crawling 86,400 pages per day, at least on that server, which is a total of 11,232,000 pages since it was "born".  DotBot currently claims 7,011,392,378 which is impossible based on their refresh rate but remove the 7,000,000,000 fluff factor and you’re left with 11,392,378 is entirely possible and almost exactly what I calculated.I’m estimating to build that index from scratch in 130 days based on the crawl rate of DotBot would require 623 concurrent threads running, perhaps possible with somewhere from 10-20 servers and a buttload of bandwidth.

Avatar
from mvandemar 1315 Days ago #
Votes: 0

Bill, I never said that a major search engine couldn’t do it, so I have no idea why you think that pointing that out is "debunking". You are absolutely correct, the 11 million is entirely possible, and what I would have expected as well. Even multi-threaded you are still going to run into lag time on the servers that are being spidered, due to connection latency and other factors. Besides, for months now they have been showing that counter based on the reasonable assumption of a page a second... why would that suddenly be off by about 63,600%?Even if they seeded the initial data, it does not make sense.

Avatar
from IncrediBILL 1315 Days ago #
Votes: 0

OK, maybe debunk was a bad term, not what I intended.My point was that even if they managed to purchase a large chunk of data which appears seeded DotBot, technically something crawled the web to get it, even if it was their wallet ;)

Avatar
from Skitzzo 1315 Days ago #
Votes: 1

So what if anything does dotbot do outside of apparently putting up extreme claims on the number of pages they’ve spidered?Is there any reason all of us shouldn’t block dotbot from accessing anything on our sites?

Avatar
from neyne 1315 Days ago #
Votes: 1

who is feeding whom with data here ??? did dotbot just receive an anonymous donation of 7 billion pages ? either they belong to Rand or sharing data that Rand has bought from other sources was a part of the contract between them.so who do we block now ?it is amazing. and to think I never would have thought of blocking it before the whole stroy got complicated. how many like me are out there  ?

Avatar
from Skitzzo 1315 Days ago #
Votes: 0

@neyne, I’m sure most webmasters out there are like you. That’s what SEOmoz is counting on to fuel their new product.Also, as randfish said in another thread: "we’re unwilling to provide a clear concise way to keep data out of Linkscape"So unless anyone here can think of a reason not to, I’ll be blocking dotbot, and many of the other bots on their list of "possible" sources. I’ll also be encouraging other webmasters to do the same.

Avatar
from neyne 1315 Days ago #
Votes: 2

@Skitzzo You mean that is what SEOMoz WAS counting on to fuel their new product. Kind of backfired on them.can you just imagine a prospective Linkscape customer doing a little homework before dishing out $230 per month for it and finding all the Sphinn/ Michael’s posts / twitters / any other articles talking about this controversy and wondering how many webmasters are not sharing their sites in this tool ? and by the same token, how reliable is the information this tool provides if there is a significant number of savvy webmasters blocking it ?And to think it all could have been avoided by timely, honest, straightforward information.

Avatar
from IncrediBILL 1314 Days ago #
Votes: 1

@ "significant number of savvy webmasters" Sorry, but the number expressing savvy webmasters is so small it’s statistically insignificant

Avatar
from neyne 1314 Days ago #
Votes: 0

"Sorry, but the number expressing savvy webmasters is so small it’s statistically insignificant"not when compared to general population, but if you want to get, say, information about your competitor, who you know is hiring an SEO agency, which is probably aware of the controversy and has made sure both the competitor and all of their websites are excluded from Linkscape, wouldn’t that diminish the value of the tool for your purposes ?

Avatar
from Skitzzo 1314 Days ago #
Votes: 0

So let me get this straight... either SEOmoz owns this bot and they are now likely lying about the number of pages they have indexed, or they don’t own dotbot and they lied in the promotion of their Linkscape tool?

Avatar
from neyne 1314 Days ago #
Votes: 3

Either SEOMOZ owns the bot or there is something wrong with ARIN’s whois data :)http://ws.arin.net/whois/?queryinput=seomoz.org

Avatar
from Statsfreak96 1313 Days ago #
Votes: 0

I estimated the crawl speed and bandwidth used by 20 popular web crawlers by analyzing log files of almost 50 websites and sorted them by websites crawled during the last 90 days showed in percentage (%), the second percentage (%) show websites crawled during the last 7 days (freshness) assuming they crawl 150 million other websites with the same crawl rate:Googlebot - 100%/100% - 4,301 million URL’s a day - 12,104 Mbits/sSlurp - 100%/88% - 7,286 million URL’s a day - 20,503 Mbits/sMSNbot - 94%/76% - 1,728 million URL’s a day - 4,863 Mbits/sTwiceler - 85%/85% - 895 million URL’s a day - 2,518 Mbits/sDotbot - 85%/20% - 52 million URL’s a day - 146 Mbits/sAsk Jeeves - 81%/33% - 44 million URL’s a day - 123 Mbits/sExabot - 81%/76% - 20 million URL’s a day - 56 Mbits/sYanga - 79%/70% - 780 million URL’s a day - 2,196 Mbits/sIa_archiver - 69%/31% - 59 million URL’s a day - 166 Mbits/sScoutjet - 63%/63% - 56 million URL’s a day - 157 Mbits/sGigabot - 58%/41% - 62 million URL’s a day - 174 Mbits/sYandex - 58%/2% - 7 million URL’s a day - 19 Mbits/sMj12bot - 56%/10% - 98 million URL’s a day - 274 Mbits/sSurveybot - 54%/51% - 25 million URL’s a day - 71 Mbits/sSpeedy spider - 52%/29% - 35 million URL’s a day - 99 Mbits/sBaiduspider - 50%/41% - 893 million URL’s a day - 2,513 Mbits/sDisco - 50%/12% - 7 million URL’s a day - 20 Mbits/sVoilabot - 44%/44% - 89 million URL’s a day - 250 Mbits/sNg - 44%/8% - 10 million URL’s a day - 27 Mbits/sNaverbot - 42%/41% - 29 million URL’s a day - 82 Mbits/sI think there is some confusion about index sizes, when Google claimed they found 1 trillion unique URL’s most website wrote Google has indexed 1 trillion web pages what’s wrong. Maybe they crawled 100-300 billion unique URL’s and indexed only 50-100 billion as "search engine index" after removing spam pages and pages with the same content.SEOmoz can create a 30 billion unique URL’s "link index" by crawling only 3 billion unique URL’s because you don’t need to crawl links to make a "link index" of them, to make a "search engine index" you need to crawl them all.The DotBot crawl speed I estimated is 52 million URL’s a day what is round 3 billion URL’s in two months, if every crawled URL contain 10-20 unique URL’s this can be enough to create a 30 billion unique URL’s "link index".SEOmoz can be very well used their other listed sources to feed DotBot with all domains and important hostnames (sub domains) and URL’s.

Avatar
from mvandemar 1313 Days ago #
Votes: 0

Direct quote from Rand:Yes - We spidered all 30 billion pages.They are not claiming to have 30 billion links, they are claiming to have the titles and all links, including anchor texts, that they found on 30 billion webpages, which by your estimate would have taken DotBot 576 days, give or take.

Avatar
from Statsfreak96 1313 Days ago #
Votes: 0

Anchor text is found with the links so for that it’s not needed to follow the links and crawl them all. If SEOmoz claims they have the title of every link in their "link index" then they need to crawl them all and left away most (90%) of the links (URL’s) they found and not followed yet.Normal you advertise the biggest numbers and that’s the amount of links in a "link index", leaving away 90% of the found links is also not obvious for me.

Avatar
from mvandemar 1313 Days ago #
Votes: 0

You’ve lost me. It’s not an ambiguous statement. He flat out said that they spidered 30 billion pages, not that they have a database that contains 30 billion urls.

Avatar
from Statsfreak96 1313 Days ago #
Votes: 0

Some months back I estimated the index size of some popular search engines by counting pages in their index for every top level domain there exist.Google 28 billionYahoo 25 billionLive 18 billionExalead 7 billionAbacho 2 billionGigablast 1 billion Gigablast has only indexed English language pages while they crawl all languages pages.When Cuil launched their “search engine index” I did some tests and they where close to Yahoo and Google in size. The next day I noticed they cheated their keyword search result counts by multiplying it with exactly three (x3) for all results, probably to match it more with Yahoo and Google.30 billion unique URL’s is really a lot to crawl because you need to deep crawl big websites what can take up a year. Gigablast "search engine index" in the past was much bigger when they had indexed all languages so probably they have still that index for their paying customers and mean SEOmoz with “we” crawled “our sources” crawled 30 billion pages.

Avatar
from Statsfreak96 1313 Days ago #
Votes: 0

I just estimate SEOmoz database size by comparing 10 random websites with Majestic-SEO database:www.webmd.com - 129,710 Linkscape - 1,026,047 Majestic-SEO - 7.91xwww.cars.com - 376,408 Linkscape - 2,609,501 Majestic-SEO - 6.93xwww.realtor.com - 72,023 Linkscape - 1,504,587 Majestic-SEO - 20.89xwww.foodnetwork.com - 265,400 Linkscape - 2,729,137 Majestic-SEO - 10.28xwww.tv.com - 2,315,514 Linkscape - 6,981,854 Majestic-SEO - 3.02xwww.holidays.net - 5,401 Linkscape - 19,939 Majestic-SEO - 3.69xwww.entertainment.com - 238,670 Linkscape - 16,890,323 Majestic-SEO - 70.77xwww.hel-looks.com - 21,296 Linkscape - 28,363 Majestic-SEO - 1.33xwww.datehookup.com - 55,200 Linkscape - 51,286 Majestic-SEO - 0.93xwww.imdb.com - 436,710 Linkscape - 3,927,727 Majestic-SEO - 8.99xWhen I leave out the two highest and lowest values I get 39.14 / 6 = 6.52x32,690,802,864 / 6.52 = 5,010,912,118 crawled pages211,051,271,656 / 6.52 = 32,350,364,079 unique URL’s (links)How deeper the crawl how less found URL’s shall be unique so the crawled pages shall even be less.

Upcoming Conferences

Search Marketing ExpoSearch Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.



Join us at an upcoming SMX event:

Upcoming Webcasts

Search Marketing Now Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include: