Sorry this site requires JavaScript to be enabled in your browser. See the following guide on How to enable JavaScript in Internet Explorer, Netscape, Firefox and Safari. Alternatively you may be blocking JavaScript with an advert-related or developer plugin. Please check your browser plugins.

There is a way to build SEOmoz’s Linkscape tool without crawling, and I show you how. If I’m wrong, SEOmoz must disclose the Linkscape User Agent, and I explain why.
Comments26 Comments  

Comments

Avatar
from DazzlinDonna 2213 Days ago #
Votes: 0

Interesting theory.  Either way, they do need come forth with full disclosure.

Avatar
from IncrediBILL 2213 Days ago #
Votes: -1

I think you missed my post on WMW where I clearly called out the user agent used to crawl my site which neatly ties back to them and it certainly wasn’t Yahoo, it was DotBot, also based in Seattle.There could be more things in play, but I didn’t look any further.

Avatar
from eKstreme 2212 Days ago #
Votes: 0

IncrediBILL: I didn’t miss your post. DotBot could be the seed index. On the DotBot website ( http://www.dotnetdotcom.org/ ), they state their index has 10,355,148 pages so far, which is a far cry from the claim SEOmoz makes of 30+ billion. The domains count is also much much smaller.So DotBot could be part of the story still.

Avatar
from JohnHGohde 2212 Days ago #
Votes: 4

From Yahoo the concept is "potentially unreliable."  From Google, it would be potentially reliable.  Even if you were interested in the metrics of a particular website, for their proprietary information, it still is only going to give you a bunch of numbers.In the stock market, this called Technical Analysis. TA dazzles you with very impresively looking graphs and charts.  The only problem is that TA doesn’t WORK.  TA is not anymore effective than charting your horoscope is at helping you live your life better.Just because your got a bunch of charts and numbers documenting what your competition is doing doesn’t magically transform your website into a winner.I would still say that this SEOmoz buzz is mostly about being able to snowjob your SEO clients with a bunch of impressive looking numbers, charts, and graphs.  But, in the end it don’t mean a hoot of beans no matter how you spy on your competition.

Avatar Moderator
from hugoguzman 2212 Days ago #
Votes: 0

@ JohnHGohde - I usually give you a hard time, but the line "TA is not anymore effective than charting your horoscope is at helping you live your life better." is absolute gold!

Avatar
from LtDraper 2212 Days ago #
Votes: 2

@DazzlinDonna - They only need to come forward with full disclosure if they’re using a bot to do the crawling -- a bot is taking something from our pages so we have a right to know. If they’re using some other method, such as an API somewhere else, then they don’t need to disclose anything. Their secrets of how they build the index aren’t ours to demand.What if SEOMoz was able to work a deal with Google to get access to the linking data? Doubtful, but it’s quite possible they’ve figured out some other way to go through an aggregator to get at the data that doesn’t require that they crawl the sites directly.Here’s some back of the envelope calculations: to crawl 30 Billion pages every 60 days, you’d need 1500 processors crawling 4 pages/second. The problem with crawling isn’t the speed of your machine, it’s the performance over on the other end. You can increase the speed and decrease the processors, but I’m guessing that $1M in venture capital wouldn’t be enough to cover that and also do the software development and pay for the bandwidth. They must be doing something else -- perhaps their VC has a portfolio company that was already doing a crawl.   Or perhaps the index isn’t going to get refreshed very often. Either way, they’ve done something pretty cool.

Avatar
from DazzlinDonna 2212 Days ago #
Votes: 1

@LtDraper, this quote is from the original blog post announcing their new tool:"Our crawl biases towards having pages and data from as many domains as possible"Now "crawl" to me implies a bot, but if that’s not the case, fine - but in that case, we have the right to know that it’s NOT a bot, so that we don’t fret about it, I would think.  Certainly, we don’t have to be given proprietary info that has no bearing on us, I agree.Added: And here - http://www.seomoz.org/linkscape/help/crawler - they specifically say they have a bot - That first graphic says "Linkscape bot crawls sites and page".

Avatar
from DanThies 2212 Days ago #
Votes: 1

There’s a bot all right.

Avatar
from johnandrews 2212 Days ago #
Votes: 1

DomainTools (also here in Seattle, acquired last year for $16m) maintains a database of every domain registered since the mid 1990s. Inside their user interface to that database, they present a thumbnail of the site (scraped by a bot), and provide an "SEO score" (don’t bother). That’s on-demand scraping (with caching obviously) of every domain ever registered. No reason to mention DomainTools here... really.. none at all. I’m just noting how "easy" it is to scrape a few billion domains at webmaster expense, and sell the info back to webmasters for a profit. And of course to find domain scraping technology and databases here in Seattle. SEOMOz LinkScrape - you left the door open, so we assumed we were welcome!

Avatar
from randfish 2212 Days ago #
Votes: 1

Sorry - commented on the other post, but not on this one. We do have bots and we are going to providing ways to say "don’t show me in your list of links" to Linkscape. We’re a little overwhelmed at the moment (and much of the team is here with me in NYC for SMX East), but we’ll be back in Seattle next week and making this a high priority issue.BTW - Besides DomainTools and SEOmoz, there are several dozen other companies that maintain large scale web crawls similar in size to the search engines for a wide variety of reasons. Most of these have monetization, so as John correctly noted, we’re certainly not alone in this, but I do understand the issue and want to address ASAP.

Avatar
from JohnHGohde 2211 Days ago #
Votes: 1

Ah, yes the magic word:  Monetized.I guest that I just going to have be more aggressive about blocking monetized bots from accessing my websites?Monetized bots need not apply.

Avatar
from LtDraper 2211 Days ago #
Votes: 1

@johnHGohde - so I suppose you’ll be blocking the googlebot, the granddaddy of all monetized bots?These aggregators require serious horsepower to accomplish what they do, and the data is valuable.  Are the vendors supposed to provide it for free?

Avatar
from IncrediBILL 2211 Days ago #
Votes: 0

@LtDraper - Google shares the wealth in terms of traffic that you can monetize and AdSense revenue sharing which is far from this parasitic behavior.

Avatar
from jimbeetle 2211 Days ago #
Votes: 0

@LtDraper - No, I don’t expect the vendors to supply it for free. Nor do I expect to supply it to the vendors for free. If they send me traffic I can convert, fine; if not, I have to make a decision whether to bar the door or not.

Avatar
from DarkMatter 2211 Days ago #
Votes: 1

even if all the savy SEOs block this tool from spidering their sites, the vast majority of sites won’t even notice it exists. In my niche, I’m betting most of my clueless competitors are not even aware of it, and that makes it a potentially valuable tool for me. By the time they wise up, I may have already taken advantage of their data and blocked my sites from showing up.

Avatar Administrator
from dannysullivan 2211 Days ago #
Votes: 9

I guess the idea that this is some great competitive advantage is lost on me. I suppose that’s because I still subscribe to the idea that if you want to know what the most powerful links are for driving traffic for a particular term, you search for that term on the search engines and see what comes up.Of course, maybe I’ll change my mind after I play with it more. If it really does show you what are the most important links pointing back to a particular site, especially in terms of driving for particular anchor text, then I suppose that’s a nice link building tool to have. When I looked at the beta, I did think it would be nice if it could show me the most important links that my competitors have, so that I could understand what "link gap" might exist.I suppose if you’re really, really worried, you’d want to block it. But then again, you have to trust that all those mozRank and other figures are really showing what Google’s doing -- and they obviously aren’t, right? I mean, people can only guess at what Google’s doing.Personally, I’m not too worried. You want to compete with me and get links in places where I’m listed? We get listed in places where editorial rules. So just knowing where we’re at doesn’t get you in the door -- you have to be good enought to walk in. And if you are good enough, well, good I guess.But some people don’t want to be crawled for various reasons, and I totally respect that. The tool should have announced a way for blocking to be enabled the minute they started spidering. Plenty of other "stealth" projects have done this. You just leave a user agent with a URL to more info that shows up in logs. Good web citizens that run spiders respect robots.txt and respect it a the start. I’ve seen Rand say elsewhere that this will now be enabled, so least they’re getting into the right place.

Avatar
from jmleray 2211 Days ago #
Votes: 0

As I commented on http://www.seomoz.org/blog/announcing-seomozs-index-of-the-web-and-the-launch-of-our-linkscape-tool"Is Linkscape an ad hoc development from SEOmoz or is there a relationship - or even a kind of agreement - between Linkscape and Majestic SEO? On http://www.majesticseo.com/ I’ve got these figures : Index stats: 32,690,802,864 crawled pages (211,051,271,656 unique urls in total) and 81,502,004 unique domains (685,461,105 with subdomains), ~ 1.5 trillion (1.5*10^12) linking relationships Quite the same figures than the ones quoted above." I wrote also an email to SEOmoz asking for explanations, but here is the answer :"Unfortunately, I cannot give you a detailed response. At this time, SEOmoz is not revealing the source of our crawl or the operation of Linkscape. Thus, we are neither confirming or denying relationships with other companies."Jean-Marie

Avatar
from Statsfreak96 2211 Days ago #
Votes: 0

MajesticSEO has since August 4 this year an API http://www.majesticseo.com/api.php so every SEO company can resell their data, but I don’t think SEOmoz use their data.In my test I see 4-8 weeks old data in the LinkScape tool against 4 months old data in the MajesticSEO tool also I get average 10 times more links in the MajesticSEO tool for example if you try phpbb.com you get over 1 billion links at MajesticSEO http://www.majesticseo.com/research/top-world-domains-by-backlinks.php against 25 million at LinkScape.Looking at http://www.majesticseo.com/research/anchor-index-quality.php I think Linkscape has maximal 30 billion links in their database and for that you need only to crawl between 1.5 and 3 billion pages as in average a page contain 10-20 unique links when you crawl deeper as MajesticSEO this ration go down.

Avatar
from IncrediBILL 2211 Days ago #
Votes: 1

I’m 100% positive that the attempts to access my site came from DotBot because I feed back tracking data.Did all of Linkscapes data come from DotBot?No clue, I can only comment on my tracking data from my own sites.MajesticSEO on their site shows different tracking data, which indicates Majestic12 did the crawling for my data fed back to their site.

Avatar
from JohnHGohde 2210 Days ago #
Votes: 0

@LtDraper Google provides a FREE search engine, as well as other valuable services.  Nor, does Google sell my proprietary information for cold hard cash.This is a wake up call to block most bots from accessing your website for security reasons.  WordPress has plugins that block evil bots from accessing your blog.  It will be interesting to see if SEOmoz is eventually added to their evil bot lists.

Avatar
from Statsfreak96 2210 Days ago #
Votes: 0

DotBot appeared in my log files at July 28, 2008 and was for 2 months active.It looks like they use 10 servers to crawl:208.115.111.242 crawl1.dotnetdotcom.org208.115.111.243 crawl2.dotnetdotcom.org208.115.111.244 crawl3.dotnetdotcom.org208.115.111.245 crawl4.dotnetdotcom.org208.115.111.246 crawl5.dotnetdotcom.org208.115.111.247 crawl6.dotnetdotcom.org208.115.111.248 crawl7.dotnetdotcom.org208.115.111.249 crawl8.dotnetdotcom.org208.115.111.250 crawl9.dotnetdotcom.org208.115.111.251 crawl10.dotnetdotcom.orgAnalyzing their 133GB data it contain 3,533,635 URL’s crawled over 10 days.This suggest a speed of 355,363 URL’s a day each server and 355,363 x 10 servers x 60 days = 213,217,800 URL’s in 2 months.Average they stored 38,737 bytes data for each URL that’s 1Mbit bandwidth each server is 10 x 1Mbit = 10Mbit bandwidth for 10 servers.If their crawlers where 10 times faster (as the MJ12 crawlers) then they could crawl 2 billion URL’s in two months and needed 100Mbit bandwidth, when they had 100 servers instead of 10 with that same high speed they could crawl 20 billion URL’s in two months but needed 1000Mbit (1Gbit) bandwidth.Maybe DotBot provided their data but it’s more likely they provided some data because I only talked about the crawl efforts but there is also CPU needed to analyze the data.There are only some search engines who reached the billion URL’s indexed page mark: Yahoo, Google, Cuil, Live, Yanga, Ask, Viola, Exalead, Abacho, Gigablast and some Chinese and Eastern Europe countries search engines. Some of them are selling their crawler technology and do custom analyzes at their data set for a lot of dollars.They could have used Nutch but the only search engine I know who manage to scaled Nutch above 1 billion pages was sproose.com.

Avatar
from johnandrews 2210 Days ago #
Votes: 1

It seems pretty clear to me that SEOMOz’s tools are aimed at a bulk of webmasters not as sophisticated as the ones engaged in these discussions. That’s how I always viewed SEOMoz anyway, and why I have never paid to be a "pro" member. So no big surprise there.However, blocking the bot is not about blocking some competitive info from leaking out. That sort of thing requires cloaking and such. Blocking the bot is to save money and avoid funding future annoyances (like junior webmasters piggy-backing on your linking partners, or selling reports on my websites to their clients as "competitive intelligence"). I don’t need to dedicate a portion of MY resources to funding RAND’s commercial enterprise. Do the math... a group of us competitive webmasters control a large portion of the web’s traffic (and pay for the bandwidth). As Statsfreak so clearly demonstrated above, someone is consuming GB’s of bandwidth to do this crawl, and that river of data flows bi-directionally through two accounts: the crawler’s ISP and my hosting pipe. For every dollar SEOMoz spends on crawling, we spend a dollar on serving pages to that crawler.That can only mean that when Rand says he needs to charge $799/year to fund this expensive crawler project, we, too are seeing proportionally expensive bandwidth bills as a consequence. If you ALSO pay Rand that $799/year, then you are paying TWICE for that portion of bandwith you contributed to his crawl (while Rand is recovering his costs, and then making a profit).For the SEOMoz pro members on $5 virtual hosting accounts it’s meaningless math - they are funders of everyone’s profit margins. But when I get my S3 bill I see a charge for every bit of data served by my sites, and it all goes against revenues to calculate ROI. Serving data to Google is part of the web game. Serving pages to SEOMoz is not.

Avatar
from scottholdren 2208 Days ago #
Votes: 2

There’s been a lot of confusion about Linkscape, and I’d like to clear up some of the vaguaries.  Unless you pay per page view, and you only get 5 page views/month, you won’t notice the Seomoz bot anymore than you’ll notice Googlebot, Slurp, or any other robot.  I’m assuming Seomoz’s bot is respecting everyone’s robot.txt, if not then shame on them.As for the reasoning behind Seomoz’s crawling, there’s no other way to assess the relationship between web pages than to crawl the whole web.  You can not reproduce Linkscape using the Yahoo API.  I don’t know if the Linkscape data will be useful, or even if it’s an accurate representation of Google’s data, but it is not something you could easily reproduce without the same crawl data that Google has.  I’ve been playing with the Linkscape reports, and while the reports are new and exciting, I’m not completely sold yet. I do hope that they are eventually able to produce numbers that are useful for SEO and that it will become a useful tool for developing SEO strategy for our clients.

Avatar Moderator
from Jill 2206 Days ago #
Votes: 0

Any updates to this user agent thing?

Avatar
from linkmoses 2203 Days ago #
Votes: 2

I suppose I have to comment, as my silence regarding Linkscape has been mistaken by several people as meaning I don’t like it.  Quite the contrary.  I’ll post here what I’ve posted elsewhere, almost verbatim.  Over the years I have lost count of the 3rd party link building and link analysis tools and software I’ve tried out, many of which are long gone.  What is most telling to me is that I abandon them when it comes time for heavy lifting deep vertical link target ID and evaluation. I wont go so far as to say “All you need is Google and your brain”, but it’s darn close to true, at least for the type of client content I work with. Linkscape is outstanding and useful for a very specific set of metrics and measures, and for a certain type of link builder will be quite helpful. I commend Rand for it and I will use it to augment my own personal approach to the link building research process. On the other hand, as much as I want and look forward to every new tool, I keep thinking about Rocky IV, where Ivan Drago was using every cutting edge tool and training method available, while Rocky ran around in the snow with a log on his back. The saaviest link builders will use tools and logs.

Avatar Moderator
from incrediblehelp 2196 Days ago #
Votes: 0

Yes I would like to know what the user agent is.

Upcoming Conferences

Search Marketing ExpoSearch Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.



Join us at an upcoming SMX event: