Published: Sep 24, 2007 - 11:48 pm
Story Found By: Eavesy 2582 Days ago
Category: Searching
61 Comments
61 Comments
Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.
Join us at an upcoming SMX event:


Learn more about search marketing with our free online webcasts and webinars from our sister site, Digital Marketing Depot. Upcoming online events include:
Comments
I get 1.37 million results for site:dmoz.org.
Algo tweak? Mistake? On purpose? The best we can hope for an end to the "I cant get in to DMOZ" threads.. Nah, that would be asking too much..
I get 2920 results for site:JohnChow.comMaybe they forgot to use the noODP metatag ;)
... and the reason that it doesnt rank is that the ODP added a set of site-wide 301 redirects from dmoz.(org¦com) and from www.dmoz.com all pointing to www.dmoz.org just a few weeks ago. I expect things to be in turmoil until Google completely recalculates PR and backlinks. I already see some interesting effects, and some patterns, here and there. I guess this will take several, to many, months to right itself: http://www.webmasterworld.com/google/3437548.htm
Too funny to see the amount of so-called SEOs that dont understand simple domain canonicalisation problems and mistake it for a ban. This is the real reason: http://www.google.com/search?num=100&q=site%3Anewhoo.com+-inurl:www http://www.google.com/search?num=100&q=site%3Awww.newhoo.com http://www.google.com/search?num=100&q=site%3Admoz.com+-inurl:www http://www.google.com/search?num=100&q=site%3Awww.dmoz.com http://www.google.com/search?num=100&q=site%3Acore-n02.dmoz.aol.com+-inurl:chefmoz http://www.google.com/search?num=100&q=site%3Admoz.org+-inurl:www http://www.google.com/search?num=100&q=site%3Awww.dmoz.org Beware of rogue SPACES that Sphinn has inserted into some of those search URLs.
G1smd, you do not know what you are talking about, the 301 from the dmoz homepage was recognized weeks ago, there is no reason why this would cause the homepage to get deindexed, something is not right. BTW I never said they were banned, at least not in this post, my blog post was hyped up for digg and I never submitted it here.
*** you do not know what you are talking about ***Hate to disagree.You obviously did not try the searches posted above, or you failed to understand them. The post at "Digg" includes the word "ban", and is wrong.
Didnt I just say the blog post that was not submitted here by myself was hyped up for digg? All of the facts that I have presented in this post are correct.
Still, please check the search strings posted above and all will become very clear. *** Didnt I just say the blog post that was not submitted here by myself was hyped up for digg? ***Ah, thats the other Sphinn thread.... http://sphinn.com/story/6606
I have checked your URLs, all I have said in this thread is that the homepage does not rank number 1. The blog post was digg bait, that despite being more popular then any other post this weekend got buried.
Totally off-topic, and I havent looked at the Digg post (I hate Digg) but......Forgive the stupidity on my part, but why on Earth would you post a fabrication on Digg? Is that the New Thing - posting falsehoods on Social Sites? I can see if it was a storytelling post - but if its about a real, true event - why the lies?
"why the lies?"DLPerry, for exactly the same reason why every media organisation lies. If you dont get that, you shouldnt be in marketing. Or even waste your time at a site that deals with marketing.You didnt think the "Boy eats own head" headline was true did you? ;)
Even if I delete the blog post, I have had over 3500 uniques since last night according to AWStats, good for my Alexa! Also lie is a harsh word, I would say it was more of an exageration.
<sigh> I guess I was hoping for something a bit more positive. Oh well - it seems the familiar old because everyone else does it and because thats the way it is mentality is still the way to go.</sigh>I mustve missed the "Boy eats own head" headline - sounds like a tabloid piece - which I personally would never believe - mainly because of the sensationalist tactics they employ, and the Liar Liar Pants on Fire reputation they have earned.And you know - I do keep seeing posts complaining about how SEO and online marketing have been getting bad a reputation. Maybe this lie tactic has something to do with that? just my old and idealistic .02 :)
Cross linking a few items. Also on Sphinn here: http://sphinn.com/story/6606 http://sphinn.com/story/6644 From what I can see, the home page is NOT in Google. Nor is Dmoz ranking for things it totally should be ranking for: http://www.google.com/search?q=open%20directory%20project http://www.google.com/search?hl=en&q=odp Like some others, I wondered if it had been hit by the Great Directory Ban Of Sept. 2007? http://searchengineland.com/070920-085657.php http://sphinn.com/story/4415 http://www.seomoz.org/blog/what-makes-a-good-web-directory-and-why-google-penalized-dozens-of-bad-onesNope, internal pages arent missing. But the home page is -- and why seems a mystery to me. I dont see Dmoz itself having any blocks on it. Other thoughts?
I think its just duped out... This one:http://core-n02.dmoz.aol.com:30080/ is in the top 10 for "dmoz".
Not sure why every single page *but* the home page is indexed. Were looking into it.
We can only hope that this is the start of a downward spiral.
A boy ate his own head?? I really think its just a dupe content issue brought up by all the redirects theyve set up. Further complicating the issue is the fact that while all of their internal home page links are pointing to http://dmoz.org, theyre all being redirected to http://www.dmoz.org. They should update those links ASAP. You dont use a 301 to fix a link when you can just change the link without causing any harm, especially when its on every page of your domain.
Why should Google rank the ODP? Google Directory is the same and DMOZ ist duplicate content ;-)
The sad thing is that so many people feel that dmoz is legit and think theres a lot of value in being listed in it. Its been corrupt for a while now, it just needs to go away.
It is a very simple domain canonicalisation problem, just like the many hundreds that I have written about many times before. It shouldnt take you very long to discover which URL the Root Page, and all the Top Level categories, are indexed under. Ill also guess that it wont take Googles indexing system much more than a month to realise what is going on and fix the problem. Heck, in one of ther posts above, I even listed some Google searches that totally show the reason for the problem.Maybe I give people far too much credit in being able to understand site: searches and what they show you.
Hey all, I dug into this a little bit with the help of a couple crawl folks. It looks like when Googlebot tried to fetch http://www.dmoz.org/, we got a 301 redirect back to http://www.dmoz.org/ . It looks like that self-loop has been going on for several days. We were last able to fetch the root page successfully on Sept. 10th, but from that point on DMOZ was returning these 301-to-itself pages, and after a few days Googlebot gave up on trying to fetch the url. It looks like the rest of the site is fine, so I suspect that if DMOZ gets 301/redirects for their root page sorted out on their webserver, well recrawl and index the page pretty quickly. DLPerry, keep the faith. If you read back over the comments, several people (g1smd, jdevalk) suggested reasonable explanations instead of going right to "ZOMG! Google hatez da Moz!?!" :)
Ah, a voice of reason :)
Odd, they arent redirecting currently, it shows a 200, then again Im not Googlebot, was only Googlebot getting the redirect? http://oyoy.eu/page/headers/?full=1&url=http://www.dmoz.org/
JohnWeb, just from a cursory glance (i.e. take this with a grain of salt), it looked like we might have been able to fetch a valid page earlier today, so they might have already made a change. Its always possible that they were doing something specific for Googles IP range, but my hope is that folks on the DMOZ side have figured it out themselves and this will sort itself out without too much extra trouble.
That makes sense, 16 hours ago DMOZ (if thats his/her real name) said they were looking into it. Ive seen dozens of sites in GWHG with their homepage removed and would never have thought of "Googlebot gave up" as the reason. Live and learn I guess.
I was working with a client whose home page poofed. In his case, he had one url 301 redirecting to the home page. He requested that redirecting URL to be removed. A few days later, the removed URL was showing 11K backlinks in Webmaster tools, as if it was the home page. And then the home page was nowhere to be found in the SERPs. So I told him to remove the redirect and have the url issue a 404 instead, and then I told him to request a URL inclusion. Few days after that, his home page came back. Cant really say what fixed the problem though, but that was the first time I saw the home page just disappear from the SERPs.
Halfdeck, what is an "URL inclusion" request? A reconsideration request? or a URL submission?
I think this is a pretty good example of why people need to scream "manual penalty!" second, and explore all other reasonable options first. :)
Thanks, Matt. Wow -- lesson learned, never 301 back to the same page. Not that I would.And remember -- never type Google into Google or youll break the internet:http://googlified.com/2007if-you-type-google-into-google/
I am not sure how an infinite loop could possibly have happened. I have looked at the canonical (www) root page several times per day in the last few weeks, and it did not ever redirect for me. There were redirects set up for (www.)newhoo.com and for (www.)dmoz.com and for dmoz.org. All of those pointed to www.dmoz.org/ . No-one would have been able to access the Root Index page at all had it been redirecting. Users would have been simply presented with a "Redirection Limit Exceeded" error message from their browser. Such a redirect would have been noticed long ago. There have been no such reports. Matt. Are you sure you didnt misread the logs and were actually looking at requests for www.dmoz.com/ being redirected www.dmoz.org/ perhaps? As you know, the ODP made some (hardware) infrastructure changes almost a month ago, and as a part of that, a non- canonical URL was accidentally exposed for indexing. I gave some very big clues in the search strings above. One of them returns the root page and 105 000 categories. The Root and the whole of the directory Top Levels are now all fully indexed under that alternative domain. The URL is that of one of the load-balancing servers. See:http://www.google.com/search?num=100&q=site%3Acore-n02.dmoz.aol.com+-inurl:chefmoz and that is an error. Google has been picking up those URLs since at least the beginning of August.AOL server techs are well aware of the issues and are working on various fixes. It just goes to show that when doing a large amount of work, server upgrades, and implementing various load-balancing changes, as well as starting to sort out various domain canonicalisation issues, that something can easily go wrong if you do things in the wrong order or miss out a step. One issue that is already being addressed is that some internal links (mostly on informational pages) are hard-coded to point to dmoz.org (non-www) URLs and those are now all being edited to point to the www version instead. That will be ongoing for many weeks.
g1smd Another great find again. Canonical URLs & Google ....so many miss this fix straight off they spend years wondering why Google doesnt give them the love they think they should get. Keep up the good work and remember Matt C works for Google.com, so not all he posts is 100% factual. Peace!
g1smd latest post looks strange in my Firefox. I have tried to gather it.g1smd wrote: I am not sure how an infinite loop could possibly have happened. I have looked at the canonical (www) root page several times per day in the last few weeks, and it did not ever redirect for me. There were redirects set up for (www.)newhoo.com and for (www.)dmoz.com and for dmoz.org. All of those pointed to www.dmoz.org/ . No-one would have been able to access the Root Index page at all had it been redirecting. Users would have been simply presented with a "Redirection Limit Exceeded" error message from their browser. Such a redirect would have been noticed long ago. There have been no such reports. Matt. Are you sure you didnt misread the logs and were actually looking at requests for www.dmoz.com/ being redirected www.dmoz.org/ perhaps?As you know, the ODP made some (hardware) infrastructure changes almost a month ago, and as a part of that, a non- canonical URL was accidentally exposed for indexing. I gave some very big clues in the search strings above. One of them returns the root page and 105 000 categories. The Root and the whole of the directory Top Levels are now all fully indexed under that alternative domain. The URL is that of one of the load-balancing servers. See: http://www.google.com/search?num=100&q=site%3Acore-n02.dmoz.aol.com+-inurl:chefmoz and that is an error. Google has been picking up those URLs since at least the beginning of August. AOL server techs are well aware of the issues and are working on various fixes. It just goes to show that when doing a large amount of work, server upgrades, and implementing various load-balancing changes, as well as starting to sort out various domain canonicalisation issues, that something can easily go wrong if you do things in the wrong order or miss out a step. One issue that is already being addressed is that some internal links (mostly on informational pages) are hard-coded to point to dmoz.org (non-www) URLs and those are now all being edited to point to the www version instead. That will be ongoing for many weeks.
There is a Sphinn bug if you go back and edit a post. I already reported that one over at http://sphinn.com/story/6814#c9643
So essentially IF Matts post on what Googlebot encountered was not Googlebot getting it totally wrong, then DMOZ has to have been cloaking it to serve it only to Google (and possibly other search engines). Otherwise tons more people would have noticed the redirecting issue because it effectively closes down that page - Ive done it before when testing :) DMOZ/AOL, please weigh in on this. Itd be interesting to find out if googlebot just royally stuffed up a 301 or if it really was an incorrectly implemented cloaked 301.
What would happen if the Webmaster Tools perferences were set to "non-www" and the site then had the non-www to www redirect implemented a few months later, without changing the Webmaster Tools setting?Whatever was going on with (www.)?(dmoz|newhoo).(com|org) this search is the key http://www.google.com/search?num=100&q=site%3Acore-n02.dmoz.aol.com+-inurl:chefmoz to unravelling the overall effect.
"What would happen if the Webmaster Tools perferences were set to "non-www" and the site then had the non-www to www redirect implemented a few months later, without changing the Webmaster Tools setting?" Interesting question. I dont think cloaking 301s like this is standard practice so itll be interesting to find out if this was indeed a googlebot bug or not....
So who is control of 301s at DMOZ, the Metas? (sorry I couldnt resist). The 301 loop sounds about on-par for how DMOZ is run in general. I continue to stand amazed that no one touches AOLs unwanted step child, as she stands broken and abused by the "system". For those that speculated that Google is anti-DMOZ, its far from the truth, and in fact, on the contrary, they are "in bed" with DMOZ... and thats the only divorce I will openly support.Matts on this thread, so perhaps he can shed some light as to why Google would partner with DMOZ amidst all of the negative PR in the industry, and the fact that it is WAY outdated because no one is really working to keep things in order. I am just amazed that they would consider it a quality resource given that over the past few years, thousands of site have not been added, or even worse, removed (not speculation, first hand info here - I am a former editor).
JohnWeb, just to be 100% clear, "Googlebot gave up" is not the root reason. I was just introducing a bit of levity. The real reason was of course the infinite redirect loop that lasted for days. If I 301 page A to point back to page A and do that infinite loop for a week (or more), its probably a bad user experience to return that infinite loop to users. But if the loop stops, then our system is set up to get the page again fairly quickly. iBrian, well-said. Danny, its pretty rare to see a site do an infinite redirect loop like that, but it does happen. g1smd, Im pretty sure that I was looking at www.dmoz.org, not dmoz.com, but I was just doing a quick/lightweight check, so I wont claim to be 100% positive.
Matt: I see no evidence of any sort of loop being created for www.dmoz.org at any time in the last few months, so I am quite purplexed. I have been keeping a careful eye out for redirect and canonicalisation issues as the changes have been made, as you might imagine. I do see Google gobbling up alternative URLs for one of the load-balancing servers for the last couple of months though.
MattCutts, thanks once again for your clarification on my miscommunication.
g1smd, I only did a cursory dig and thats what it looked like at that point. Ive been asking about it more, and it looks like dmozs 301 might have interacted badly with a heuristic on Googles side. Im still keeping an eye on it and Ill bug the crawl team until everything looks good.
Out of curiosity, Ive managed to create a page that will return a 301 but not redirect so that the browser will show the content of the page. http://www.jlh-design.com/2007/09/googlebot-gave-up/#comment-5301 Could DMOZ screw up their code this much? Id imagine its possible. Curious enogh though a browser shows the page content (appearing normal to a user) the online header checker I used makes an "assumption" that the page should redirect to itself.This may not have been the exact mechanism for the DMOZ page dropping out of Google, but at least I can replicate it somewhat. Coming up next week on Myth Busters, Jamie and JohnWeb...
Thanks Matt. Ill ping the AOL server techs with your comments. I do now see stuff like this reappearing in a site:www.dmoz.org search: www.dmoz.org/Arts/ - 14 hours ago - Similar pages
All is right with the world again.Full story here: http://blog.dmoz.org/2007/09/26/the-search-for-dmoz/
"part of an index recognizing, adjusting and updating in real time" You really believe the people here are going to fall for that?
JLH, by "URL inclusion" I mean going into Webmaster Tools, going to the "Removed URLs" console and clicking on "reinclude" or something like that. I havent tried it myself so Im not sure what the UI is called exactly. "part of an index recognizing, adjusting and updating in real time" You really believe the people here are going to fall for that?Whats amusing to me is the blog post is tagged "Truth" :)
Halfdeck, Okay, thanks for the answer, I knew I was missing something.
Well, about 80 000 pages from www.dmoz.org reappeared overnight (UK time) in Googles index, starting about the time that Matt Cutts posted here... so how much more realtime do you wanna get with this stuff?
Matt, I think the issue here is bigger than what initially appears.An incredible number of pages which URLs have been changed to http://www.dmoz.org/... from http://dmoz.org/... during their process of canonicalisation, are currently not in Google cache and their links are not being considered by Google. As an example:A google search for "Academy of Canadian Cinema and Television" site:www.dmoz.com, brings no results, although these words are clearly on the page: http://www.dmoz.org/Regional/North_America/Canada/Arts_and_Entertainment/So at the moment Google does not see this link from DMOZ, together with millions of other links in other pages on DMOZ.This is affecting millions of websites and of course an incredible amount of Google search results, until the canonicalised http://www.dmoz.org/... pages are indexed and cached in Google.I think that the engineers at Google should have a proper look into this, trying to index all the DMOZ pages with new URLs as soon as possible. In fact a huge number of searches and even tests at Google on new algorithms might be altered by this effect.
I would give them a couple of weeks or more to discover everything.At one page per second, they can spider 86 400 pages per day.The ODP has at about 20 times that amount of pages (also counting category descriptions, guidelines, FAQ pages, profiles, etc).See also: http://www.google.com/search?num=100&q=site%3Acore-n02.dmoz.aol.com+-inurl:chefmoz
The ODP content was previously available through more than 30 different domains and direct IP addresses. These had been hosted at various times by Netscape, Mozilla, and AOL.A few months ago, along with some necessary hardware changes and upgrades, everything was reconfigured so that just www.dmoz.org became the canonical domain.At first, there were a few glitches showing in the listings within Google SERPs. Several domains were missed in the canonicalisation fixes, and were rapidly indexed in preference to www.dmoz.org by Google.Once those holes were plugged, Google began to slowly re-index the other non-canonical versions of the directory. Some of the URLs dropped into the Supplemental Index, but most of them were de-indexed.After just a few months, there are just a few hundred incorrect URLs showing up. Most of the problem URLs have now been completely de-indexed.The main listings for www.dmoz.org show almost one million URLs indexed in Google when using the site:www.dmoz.org search.The job is now just about complete.
Some of the comments above are now displayed in the wrong order after being edited to remove some formatting issues.The correct order can be deduced from the post number (behind the # link on each post) rather than from the post date.
Everything is now back on track.See that the Duplicate Content has fallen to almost zero URLs indexed:http://www.google.com/search?num=100&q=site%3Anewhoo.com+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.newhoo.comhttp://www.google.com/search?num=100&q=site%3Admoz.com+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.dmoz.comhttp://www.google.com/search?num=100&q=site%3Acore.dmoz.aol.comhttp://www.google.com/search?num=100&q=site%3Adirectory.mozilla.orghttp://www.google.com/search?num=100&q=site%3A207.200.81.183http://www.google.com/search?num=100&q=site%3A207.200.81.184The Canonical Domain now has almost a million pages indexed:http://www.google.com/search?num=100&q=site%3Awww.dmoz.orgSome Supplemental Results can hang around for a very long time:http://www.google.com/search?num=100&q=site%3A207.200.81.154
The & bug mashes the URLs, and stops them working.Remove the amp; bit from the URL to get it to work.
Everything is now back on track for ODP site re-indexing.See that the Duplicate Content has fallen to almost zero URLs indexed:http://www.google.com/search?num=100&q=site%3Anewhoo.com+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.newhoo.comhttp://www.google.com/search?num=100&q=site%3Anewhoo.org+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.newhoo.orghttp://www.google.com/search?num=100&q=site%3Admoz.com+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.dmoz.comhttp://www.google.com/search?num=100&q=site%3Acore.dmoz.aol.comhttp://www.google.com/search?num=100&q=site%3Adirectory.mozilla.orghttp://www.google.com/search?num=100&q=site%3Agnuhoo.com+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.gnuhoo.comhttp://www.google.com/search?num=100&q=site%3Agnuhoo.org+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.gnuhoo.orghttp://www.google.com/search?num=100&q=site%3A207.200.81.135 http://www.google.com/search?num=100&q=site%3A207.200.81.139 http://www.google.com/search?num=100&q=site%3A207.200.81.140http://www.google.com/search?num=100&q=site%3A207.200.81.175http://www.google.com/search?num=100&q=site%3A207.200.81.183http://www.google.com/search?num=100&q=site%3A207.200.81.184http://www.google.com/search?num=100&q=site%3A207.126.111.202http://www.google.com/search?num=100&q=site%3A207.126.111.231 The Canonical Domain now has almost a million pages indexed:http://www.google.com/search?num=100&q=site%3Awww.dmoz.orgSome Supplemental Results can hang around for a very long time:http://www.google.com/search?num=100&q=site%3A207.200.81.154That IP address has been out of use for a long time.Including the direct IP address accesses, and various sub-domain and load-balancer URLs, there used to be ~34 ways to get to ODP content as hosted by Netscape/AOL servers. Now there is only one way.
The & bug mashes the URLs, and stops them working.Remove the amp; bit from the URL to get it to work.
The last few hundred listed URLs have now become the last few dozen to still show.I am guessing that the problem will be completely fixed in the next few weeks.
I gave up on DMOZ long ago, they never accept any of my sites ;-)
@CharlesBM - yeah, i stopped trying to get in dmoz too