Published: Oct 20, 2008 - 01:42 pm
Story Found By: firebreatherseo 1315 Days ago
Category: SEM
17 Comments
17 Comments
Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.
Join us at an upcoming SMX event:
Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:
Comments
I hate to debunk part your post, but when I busted the story that it was DotBot behind Lnkscape I also published that DotBot had been seeded with data from another search engine, possibly Amazons AWS.How I know this is because DotBot has never been allowed to crawl my site yet DotBot knew of internal pages in advance that it had never seen which leads to two options a) DotBot is clairvoyant or b) DotBot was pre-loaded with data from another source to seed the crawl.Lastly, dont think linearly, that a server is only getting 1 page per second. Most search engines are multi-threaded and pull down 100s of pages per second which make it more possible with a small bank of servers so starting with a seeded database and running multi-threaded you could pull off a pretty good sized index in a couple of months.Its a simple matter of how much bandwidth you throw at the problem.However, Ill back up your math at the moment based on DotBots stated stats.Claims to have come online June 10th 2008 and appears to crawl about a page a second based on the displayed stats so its been running approximately about 130 days. At that crawl rate its roughly crawling 86,400 pages per day, at least on that server, which is a total of 11,232,000 pages since it was "born". DotBot currently claims 7,011,392,378 which is impossible based on their refresh rate but remove the 7,000,000,000 fluff factor and youre left with 11,392,378 is entirely possible and almost exactly what I calculated.Im estimating to build that index from scratch in 130 days based on the crawl rate of DotBot would require 623 concurrent threads running, perhaps possible with somewhere from 10-20 servers and a buttload of bandwidth.
Bill, I never said that a major search engine couldnt do it, so I have no idea why you think that pointing that out is "debunking". You are absolutely correct, the 11 million is entirely possible, and what I would have expected as well. Even multi-threaded you are still going to run into lag time on the servers that are being spidered, due to connection latency and other factors. Besides, for months now they have been showing that counter based on the reasonable assumption of a page a second... why would that suddenly be off by about 63,600%?Even if they seeded the initial data, it does not make sense.
OK, maybe debunk was a bad term, not what I intended.My point was that even if they managed to purchase a large chunk of data which appears seeded DotBot, technically something crawled the web to get it, even if it was their wallet ;)
So what if anything does dotbot do outside of apparently putting up extreme claims on the number of pages theyve spidered?Is there any reason all of us shouldnt block dotbot from accessing anything on our sites?
who is feeding whom with data here ??? did dotbot just receive an anonymous donation of 7 billion pages ? either they belong to Rand or sharing data that Rand has bought from other sources was a part of the contract between them.so who do we block now ?it is amazing. and to think I never would have thought of blocking it before the whole stroy got complicated. how many like me are out there ?
@neyne, Im sure most webmasters out there are like you. Thats what SEOmoz is counting on to fuel their new product.Also, as randfish said in another thread: "were unwilling to provide a clear concise way to keep data out of Linkscape"So unless anyone here can think of a reason not to, Ill be blocking dotbot, and many of the other bots on their list of "possible" sources. Ill also be encouraging other webmasters to do the same.
@Skitzzo You mean that is what SEOMoz WAS counting on to fuel their new product. Kind of backfired on them.can you just imagine a prospective Linkscape customer doing a little homework before dishing out $230 per month for it and finding all the Sphinn/ Michaels posts / twitters / any other articles talking about this controversy and wondering how many webmasters are not sharing their sites in this tool ? and by the same token, how reliable is the information this tool provides if there is a significant number of savvy webmasters blocking it ?And to think it all could have been avoided by timely, honest, straightforward information.
@ "significant number of savvy webmasters" Sorry, but the number expressing savvy webmasters is so small its statistically insignificant
"Sorry, but the number expressing savvy webmasters is so small its statistically insignificant"not when compared to general population, but if you want to get, say, information about your competitor, who you know is hiring an SEO agency, which is probably aware of the controversy and has made sure both the competitor and all of their websites are excluded from Linkscape, wouldnt that diminish the value of the tool for your purposes ?
So let me get this straight... either SEOmoz owns this bot and they are now likely lying about the number of pages they have indexed, or they dont own dotbot and they lied in the promotion of their Linkscape tool?
Either SEOMOZ owns the bot or there is something wrong with ARINs whois data :)http://ws.arin.net/whois/?queryinput=seomoz.org
I estimated the crawl speed and bandwidth used by 20 popular web crawlers by analyzing log files of almost 50 websites and sorted them by websites crawled during the last 90 days showed in percentage (%), the second percentage (%) show websites crawled during the last 7 days (freshness) assuming they crawl 150 million other websites with the same crawl rate:Googlebot - 100%/100% - 4,301 million URLs a day - 12,104 Mbits/sSlurp - 100%/88% - 7,286 million URLs a day - 20,503 Mbits/sMSNbot - 94%/76% - 1,728 million URLs a day - 4,863 Mbits/sTwiceler - 85%/85% - 895 million URLs a day - 2,518 Mbits/sDotbot - 85%/20% - 52 million URLs a day - 146 Mbits/sAsk Jeeves - 81%/33% - 44 million URLs a day - 123 Mbits/sExabot - 81%/76% - 20 million URLs a day - 56 Mbits/sYanga - 79%/70% - 780 million URLs a day - 2,196 Mbits/sIa_archiver - 69%/31% - 59 million URLs a day - 166 Mbits/sScoutjet - 63%/63% - 56 million URLs a day - 157 Mbits/sGigabot - 58%/41% - 62 million URLs a day - 174 Mbits/sYandex - 58%/2% - 7 million URLs a day - 19 Mbits/sMj12bot - 56%/10% - 98 million URLs a day - 274 Mbits/sSurveybot - 54%/51% - 25 million URLs a day - 71 Mbits/sSpeedy spider - 52%/29% - 35 million URLs a day - 99 Mbits/sBaiduspider - 50%/41% - 893 million URLs a day - 2,513 Mbits/sDisco - 50%/12% - 7 million URLs a day - 20 Mbits/sVoilabot - 44%/44% - 89 million URLs a day - 250 Mbits/sNg - 44%/8% - 10 million URLs a day - 27 Mbits/sNaverbot - 42%/41% - 29 million URLs a day - 82 Mbits/sI think there is some confusion about index sizes, when Google claimed they found 1 trillion unique URLs most website wrote Google has indexed 1 trillion web pages whats wrong. Maybe they crawled 100-300 billion unique URLs and indexed only 50-100 billion as "search engine index" after removing spam pages and pages with the same content.SEOmoz can create a 30 billion unique URLs "link index" by crawling only 3 billion unique URLs because you dont need to crawl links to make a "link index" of them, to make a "search engine index" you need to crawl them all.The DotBot crawl speed I estimated is 52 million URLs a day what is round 3 billion URLs in two months, if every crawled URL contain 10-20 unique URLs this can be enough to create a 30 billion unique URLs "link index".SEOmoz can be very well used their other listed sources to feed DotBot with all domains and important hostnames (sub domains) and URL’s.
Direct quote from Rand:Yes - We spidered all 30 billion pages.They are not claiming to have 30 billion links, they are claiming to have the titles and all links, including anchor texts, that they found on 30 billion webpages, which by your estimate would have taken DotBot 576 days, give or take.
Anchor text is found with the links so for that its not needed to follow the links and crawl them all. If SEOmoz claims they have the title of every link in their "link index" then they need to crawl them all and left away most (90%) of the links (URLs) they found and not followed yet.Normal you advertise the biggest numbers and thats the amount of links in a "link index", leaving away 90% of the found links is also not obvious for me.
Youve lost me. Its not an ambiguous statement. He flat out said that they spidered 30 billion pages, not that they have a database that contains 30 billion urls.
Some months back I estimated the index size of some popular search engines by counting pages in their index for every top level domain there exist.Google 28 billionYahoo 25 billionLive 18 billionExalead 7 billionAbacho 2 billionGigablast 1 billion Gigablast has only indexed English language pages while they crawl all languages pages.When Cuil launched their “search engine index” I did some tests and they where close to Yahoo and Google in size. The next day I noticed they cheated their keyword search result counts by multiplying it with exactly three (x3) for all results, probably to match it more with Yahoo and Google.30 billion unique URL’s is really a lot to crawl because you need to deep crawl big websites what can take up a year. Gigablast "search engine index" in the past was much bigger when they had indexed all languages so probably they have still that index for their paying customers and mean SEOmoz with “we” crawled “our sources” crawled 30 billion pages.
I just estimate SEOmoz database size by comparing 10 random websites with Majestic-SEO database:www.webmd.com - 129,710 Linkscape - 1,026,047 Majestic-SEO - 7.91xwww.cars.com - 376,408 Linkscape - 2,609,501 Majestic-SEO - 6.93xwww.realtor.com - 72,023 Linkscape - 1,504,587 Majestic-SEO - 20.89xwww.foodnetwork.com - 265,400 Linkscape - 2,729,137 Majestic-SEO - 10.28xwww.tv.com - 2,315,514 Linkscape - 6,981,854 Majestic-SEO - 3.02xwww.holidays.net - 5,401 Linkscape - 19,939 Majestic-SEO - 3.69xwww.entertainment.com - 238,670 Linkscape - 16,890,323 Majestic-SEO - 70.77xwww.hel-looks.com - 21,296 Linkscape - 28,363 Majestic-SEO - 1.33xwww.datehookup.com - 55,200 Linkscape - 51,286 Majestic-SEO - 0.93xwww.imdb.com - 436,710 Linkscape - 3,927,727 Majestic-SEO - 8.99xWhen I leave out the two highest and lowest values I get 39.14 / 6 = 6.52x32,690,802,864 / 6.52 = 5,010,912,118 crawled pages211,051,271,656 / 6.52 = 32,350,364,079 unique URLs (links)How deeper the crawl how less found URL’s shall be unique so the crawled pages shall even be less.