- 52
- Sphinn It!
Posted By: rjonesx 337 days ago
Topic Type: News Story (Jump to http://www.thegooglecache.com)
Category: Searching
39 Comments
39 Comments
Save the date for:
SMX Singapore - July 2-3, 2009
SMX São Paulo - August 4-5
SMX East - October 5-7, 2009
SMX Stockholm - 12-13 October, 2009
SMX Mexico - November 11, 2009
Learn more about search marketing through free online webcasts and webinars from our sister site Search Marketing Now.
Comments
There are certain classes of HTML coding error that ensure that parts of a page do not get indexed, and others that ensure that links on the page after the error occurs aren't followed. As a minimum you would want to avoid those.
As a professional coder declaring in the DOCTYPE that you have used HTML 4.01 Strict (or whatever) why would you then not code the page to the standard that you declared it as being? That makes no sense to me. When I do code QA on a site it always leads to discovery of other problems.
There's a well-known SEO whose "contact us" page fails to render in both Opera and Safari due to a simple coding error. It's been like that for over a year now too. It will be interesting to see how long that lasts.
I think the issue is actually whether it is reasonable for a search engine to concern itself with the ability of a webmaster to code properly. Unless there is a strong, causal relationship between proper code and relevancy or popularity, it should not be used as a ranking factor.
Moreover, it is difficult to make a lot of user-generated content validate due to character set issues (someone tries to comment in a foreign language, or use non UTF-8 characters) or the use of HTML.
Regardless, I am glad to see that someone took the time to actually study the issue and see whether it matters or not. I still think validation is good practice. But too many people promote validation as a method of SEO, which it is not.
It is a no brainer. W3C Validation is ONLY good for solving a few obscure display bugs. It has absolutely nothing whatsoever to do with crawlability or with ranking.
You sure about that?
Really sure?
I've fixed a number of sites where a coding error has previously stopped the bot dead in its tracks.
Googlers have made it clear several times there's a limit to what Googlebot can deal with. Try coding a parser - you'll find there's a million ways crappy code can throw off an HTML parser.
Does HTML code need to be bulletproof? Obviously not, but you'd be wrong to believe anything goes.
Everyone is welcome to visit my Natural Organic SEO blog where I discuss the W3C Validation issue in detail on several different pages and posts. Further, I quote Google and a Googler as saysing that Validation flat out does not count in Google.
In short, W3C Validation is a total scam that is being promoted literally on thousands of different SEO websites and blogs. W3C Validation does absolutely nothing, but tear the down the reputation of the SEO Industry in general.
Here again, want to debate this issue? Then bring it up my SEO blog.
The suggestion isn't that W3C Validation is a ranking factor in terms of algorithmic weight, but that non-valid markup can result in crawling issues.
As g1smd said, if you decalre a DOCTYPE, then you are effectively stating "these are the rules that this markup will adhere to". If you're not going to bother ensuring that it follows those rules, why bother?
And I fail to see how W3C Validation is a "scam". It's an effort to promote standards and interoperability on the web. Anyone who has any meaningful experience in web development would know that.
Believe me I have studied the validation issue in great detail. Belief in the value of W3C Validation is based upon superstitious behavior and just plain bad observational skills. The link to the crawlability issue is based upon a total misread of Google's position.
The promotion of W3C Validation is a scam because its promoters are making totally outrageous claims without offering any proof whatsoever.
The best proof of my position comes from the success of Amazon.com, whose home page has 1,000+ validations errors. If the merits of the W3C Validation argument had any truth to them at all, Amazon.com would be dead in the water rather than making millions of dollars.
"Believe me I have studied the validation issue in great detail."
Funny, besides you I've hardly ever seen anyone on Sphinn - not even Rand Fishkin or Danny Sullivan - defend a position by appealing to hir own authority. Even with all the success behind and ahead of him, Rand can't get away with it. So what makes you think you can?
@JohnHGohde
Amazon is always going to be a particularly poor choice for proving a point. Most of us know of several big-name sites who are openly buying links or breaking other Google guidelines. Does that prove that these activities are safe? No. It just shows that some well-known trusted sites are treated differently.
On this point, I know that Amazon have often been mentioned in relation to trusted feeds. I don't know the extent of this but maybe they no longer care about normal crawling anyway?
You can't argue with gs1md and HD - if the spider can not read the code, it will leave the site. That has to negatively impact the ranking whichever way you look at it. If you meet that basic level then fine, XHTML 1.1 Strict compliance will not give you any kind of rankings boost or advantage.
Personally though I think it's an important point for usability. Rankings and traffic are pointless if the user can't use your site when they arrive. Especially as people are starting to use a wider range of platforms (phones/ PDAs/ Ubuntu) these days. Standard compliant sites are less likely to have display errors.
@Halfdeck
I have placed a challenge on a new blog discussion in Sphinn. And, then there is my SEO blog.
I never appealed to my own authority, here. I simply made a statement of fact.
I have studied the validation issue in enough detail to write 2 separate pages, at least one post, and numerous references to, plus a page or two on crawlability on my SEO blog. These are simple statements of facts.
Now, how about producing some proof of your assertions?
@NickWilsdon
This is beginning to get tiresome. First, you cannot causally discount the example of Amazon.com due to both its extreme number of validation errors and its extreme level of success, by any standard. Next, you have made a number of assertions and assumptions none of which have been supported by proof of any kind.
This will be my last comment on this Sphinn. Please take it to my blog discussion on Sphinn or to my SEO blog.
You guys are talking about two different issues.
Of course HTML errors that cause a page not to load properly stands a good chance of not being able to be indexed properly.
But that's a completely different issue than W3C HTML validation. Sure, validating might help spot problem areas, but so does viewing your page in your browser or in multiple browsers.
Even better, view it in Google's text cache or a lynx browser.
@JohnHGohde: Proof of what assertions? That an HTML parser has limits to what it can successfully parse?
The fact is that search spiders are incredibly stupid. This is by design, because with the sheer amount of pages that need to be crawled, time and processing resources must be utilized as efficiently as possible. Consequently, a search spider is not going to have the same tolerance for malformed (X)HTML that a user browser such as Firefox or IE would, which are designed assume the "intent" of a document that has syntactical errors in the markup.
As it has been said already, no one is suggesting that syntactically valid markup has any algorithmic weight. It is a crawlability issue, plain and simple.
Furthermore, writing two pages and a blog post does not constitute proof of your assertions. Qute the opposite, I discovered after reading your article on W3C Validation. Case in point (from the article): "According to them, anybody who cannot knock out HTML coding in notepad don’t know what they are doing. This of course is total nonsense."
That statement, as well as the rest of the article, only proves a lack of any meaningful understanding of the technical implementation of a website. "Nonsense"? HTML is so simple to learn that anyone could create a basic page in notepad after spending an hour or two reading up on it. The only quackery going on here is found in your arguments. Try coding a web spider or HTML parser before you make any more ignorant arguments and continue making a fool of yourself.
@ColinCochrane
As I previously stated, this is no place for a blog discussion. I have posted my response in the blog discussion in Sphinn that I have set up for this purpose.
"As I previously stated, this is no place for a blog discussion."
Why wouldn't this be the place for it?
Let's keep the discussion the same place.
Proof means, show me. Show me an indexed webpage where the text-only version of Google's cache copy supports your assertion that Google search spider is incredibly stupid.
My assertion is that text-version of Google's cache copy does NOT get their by magic. It gets there by way of their rather intelligent search spider."
Wasting time hunting for such an example is unnecessary. Sufficiently malformed markup will not be parsed properly.
Thank you for supporting my argument. So, far you have shown a total lack of the concept of proof. The people making a claim, ie those who advocate the value of W3C Validation, have responsibility of proving their assertions. I do not have to prove anything."
If you want to start lecturing us on proper argument, so be it. That said, let's take a look at the first paragraph of your W3C Validation post, shall we?
"The basic crux of the W3C Validation argument is that all modern webpage coding should comply with the original coding standards. In other words, its basic argument is that all progress is inherently bad. Therefore the basic W3C Validation concept is ludicrous."
We can break this down to make the argument clearer:
- The W3C Validation argument states that all modern webpage coding shall comply with the original coding standards.
- The W3C Validation argument states is that all progress is inherently bad.
- Therefore, the W3C Validation concept is ludicrous.
The W3C Validation argument states that all modern webpage coding shall comply with the original coding standards.: Well, you seem to like talking about proving assertions, so why did you find it unnecessary to state or reference this supposed argument? You did not define what this supposed "W3C Validation argument" even is. Strike one.The W3C Validation argument states is that all progress is inherently bad.
See above. Strike two.
Therefore, the W3C Validation concept is ludicrous.
Therefore...nothing. This does not follow. Strike three, and you're out. The rules of logic would like a word with you.
"Most of my Natural Health website validates perfectly. But obviously rather than support the value of W3C Validation those who choose to promote this total nonsense always choose to tear down those who kindly ask them to actually prove the value of their superstitious nonsense."
And the final nail in the coffin. For someone who is so irrationally opposed to the W3C, it seems odd that you would take the time to ensure the markup on your site is compliant.
do not feed the trolls.
First, I am very happy that this Sphinn has gone Hot. Now, all this off-topic discussion wont hurt rjonesx.
@MikeDammann
It is an IQ test.
@ColinCochrane
Not only is providing proof not a waste of time, but you then proceded to make yet another totally unproven assertion in the same paragraph.
My health accredited natural health website validates perfectly at:
naturalhealthperspective.com
Only a couple of days ago, it managed to rank as high as #2 for natural health in the SERPs. The reason I validated it was because I had gotten tired of being repeatedly attacked for having non-validating webpages. The response that I have gotten for my efforts in various seo forums was that I needed to validate my webpages, by the W3C Validation police.
The rest of your comments while cute, do not prove anything. While you seem happy to attack one of my SEO blog pages, you failed to mention that I have no fewer than six (6) external outbound links on that page that support my positions.
Rather than having discounted my argument, your current and prior coments are very revealing. To be more specific YOU stated previously.
No, I really do not think so. Google provides a text-only version of their Google cache copy which conclusiviely points out whether a given webpage is crawlable or not. And, exactly how Google sees the webpage in question. Yes, not only are you supposed to write mark up code direcly in notepad, but before you can start on your next post; everyone now has to write their own bot, according to you.
Actually, I am quite good at editing raw HTML source code. And, ususally prefer to deal with raw HTML code over using these screwy visual editors (like the one on Sphinn). I am able to deal with extremely messy HTML markup coding, thank you.
There is my proof. Where is your proof? You repeated relutance to support any of your assertions with an actual example from literally millions of offending non-validating webpages to choose from is very telling.
Show me the beef! Show me your proof.
"Google provides a text-only version of their Google cache copy which conclusiviely points out whether a given webpage is crawlable or not."
Any brain-dead bot can extract text from HTML.
Google will not necessarily parse a malformed META description tag correctly, for example, as clearly stated by a Googler in a GGWH thread. If A HREF element is screwed up enough, its components (URL, anchor text, ALT text of an IMG wrapped in an HREF) may not get parsed correctly. Those things have nothing to do with simple text extraction.
Though a somewhat separate issue, if you study Google's snippet generation bot's behavior, you will notice that complicated HTML can confuse the bot into extracting NAV elements as META description instead of content copy or page TITLE. Why does that happen? Because page structure defined by HTML elements (H1, div, TABLE, etc) needs to be referenced to guess at the meat of a page and to avoid extracting irrelevant stuff like copyright text. When those elements are screwed up and the bot can't make an educated guess, it just defaults to using text found at the top of HTML source.
There's more to crawling and indexing than just pulling text off an HTML page.
JohnMu, a Googler I'm sure you're familiar with, regarding broken HTML:
http://groups.google.com/group/Google_Webmaster_Help-Indexing/browse_thread/thread/e7ed055b74cb4aaa/
@Halfdeck
I can quote a more experienced Googler:
Then I can quote what Google has officially said in one of their own FAQ webpages.
@JohnMu's Search Comments
That search turned up about 16 webpages that are in fact in the Google Index. They in fact have a cache copy. Which means that they are ALL so crawlable. The only thing thing wrong here is their Snippet.
There are certainly faster ways to catch errors of this type than to validate the 3,000 webpages that have been indexed for that particular website.
This type of obsure snippet error certainly does not motivate me to validate my webpages. What it does do, if anything, is to motiovate me to continue checking Google's listings for new webpages as they are crawled by Google. That method is a heck of a lot faster than validating 3,000 webpages, one at a time.
Ya thunk any of these pages have screwed up their ranking, bearing in mind the title element is one of the most important things on the page?
http://www.google.com/search?num=100&q=title+meta+name+description+content
http://www.google.com/search?num=100&q=meta+name+description+content
"we don’t have a signal like that in our algorithms"
Who ever claimed Google does?
"usually doesn’t matter that much to search engines"
You can turn that sentence into "validation doesn't matter at all to search engines, ever" in your head if you want, but then you'd be stuffing words into Matt's mouth.
John Mueller didn't say validation or lack of it is generally an issue. He said there are limits. That's a statement of fact, not a recommendation to validate pages, but just to say some broken HTML will make Googlebot choke. That statement doesn't conflict with Matt's statement that usually broken HTML isn't much of an issue with Google.
"There are certainly faster ways to catch errors of this type than to validate the 3,000 webpages"
Heh, who's talking about validating 3,000 pages, one at a time? A few HTML errors isn't anything to sweat over. You don't need to validate a single page if you write relatively clean code. Even if your pages are a mess, Google can deal with most of the errors without any hand-holding.
If the site uses a CMS, and some sort of layout template, then it is only necessary to check a small sample of pages, and then fix the template code in order to fix the entire website.
@ColinCochrane
Actually, I have. I once wrote a Turbo Pascal program that parsed TP programs in DOS and printed out a fancy formated soruce code listing. It would be a simple matter of switching parsing TP to parsing for HTML. Nothing particularly difficult about it all.
@JohnMu search comments
I am currently running an automatic program test.
In some eight years of writing static webpages, I have never once made that type of coding error. Could have happened from the improper use of a text search and replace program.
But, checking for improperly closed title tags is something that any person could do manually. And, I have already provided a number of ways that this type of error can be detercted without use of W3C Validation.
It boils down to looking out for a small handful of errors versus going through the extremely time consuming nonsense of W3C Validation balking on so much nonsense that makes absolutely no difference whatsoever.
Currently, my list of coding errors to check for consists of 2 items. Whoopie Do!
@JohnHGohde:
"Actually, I have. I once wrote a Turbo Pascal program that parsed TP programs in DOS and printed out a fancy formated soruce code listing. It would be a simple matter of switching parsing TP to parsing for HTML. Nothing particularly difficult about it all."
Then you should be cognizant of the fact that you sacrifice performance when you increase tolerance for non-standard input. Remember the scale on which a search spider works; any small decrease in performance is multiplied significantly when calculated against the sheer amount of pages being crawled.
"It boils down to looking out for a small handful of errors versus going through the extremely time consuming nonsense of W3C Validation balking on so much nonsense that makes absolutely no difference whatsoever."
It's not time-consuming if you know proper HTML. The standard is set by the W3C, and you either write the markup properly, or you don't. Many validation errors won't cause serious issues with compatibility, or crawling, but that's not the point. If you declare a DOCTYPE you are stating that the markup within that document will follow the rules defined in the DTD, and W3C Validation is the way to check whether or not the document actually follows those rules. If you don't care about those rules, then don't bother with a DOCTYPE.
Those of us who take pride in creating clean markup and code, and prefer to eliminate any possibility of parsing-related crawling issues, will ensure their documents validate. It has nothing to do with superstition, nor is it a desire to waste time. Personally, I find it faster and easier to simply use valid markup, rather than taking time to decide on what specific validation errors are acceptable, and which ones aren't.
*** I once wrote a Turbo Pascal program that parsed TP programs in DOS ***
Sure, but Turbo Pascal is very intolerant of any typos in code. If the code doesn't follow the published standards then it doesn't run at all.
Care to comment on my 2 million real life examples linked to from the searches above? Non-valid code, that Google didn't parse correctly.
@g1smd
Some types of Header Section errors produces a normal looking text-only version of the Google cache copy.
You did a horribly bad job of filtering out the perfectly valid webpages. So, you cannot state that there are 2 million webpages in Google with problems. Furthmore, Google's included numbers in searches are wildly exaggerated in general since Google tends to give you a huge number of totally erroneous hits at the tail end of most searches.
@Halfdeck
First, I am NOT your buddy.
Second, they have a lot in common (ie, the parsing of source code text).
"I once wrote a Turbo Pascal program that parsed TP programs in DOS"
Apples and oranges. Parsing HTML is trivial if all you're looking to do is extract copy. But what appears to be pretty straight-forward like pulling URL and anchor text can get tricky even with valid HTML but especially if code is broke. They won't get parsed unless you antipiate a variety of ways people might mangle A HREF code.
There was a case mentioned on SEOmoz where a URL ending in .0 was dumped by Googlebot. Guess when Google checks for file extensions? That's just one intracy out of many that has nothing to do with Coding TP programs in DOS.
*** you cannot state that there are 2 million webpages in Google with problems ***
Since 91 out of the first 100 results *DO* show the problem, and the SERP returns 2.7 million results, I think that 2 million is a conservative estimate. Even if it were 1 million, then that is still a heck of a lot. You assert there are no problems and I am showing you 1 million, 2 million, whatever, examples.
*** Don't feed the trolls ***
Good advice. Wish I had paid attention to this gem of a comment a bit earlier.
I feel totally vindicated by this discussion, probably because I have been in fact vindicated.
So far, we see that only a very tiny number of malformed coding errors in the Header Section of your markup coding can at best degrade your search ranking in Google for a given web page by messing up the title and meta description tags for the respective web pages. These types of coding errors do in fact visibly show up in the Google listings with a totally screwed up Snippet as well as in a number of other ways that are likewise detectible. So, there is never any uncertainty as to whether or not some type of undetected W3C Validation problem is screwing with your ranking.
Further, these types of problems are most likely to affect the minority of websites on the Internet that are still using static web pages. Dynamically generated website are not at all likely to be experiencing these problems. And, if they do. One quick fix, will automatically fix all of their web pages.
As previously stated there are a number of simple checks that anybody can manually perform that will quickly catch these types of markup coding errors.
Google is in fact totally aware of this problem and will point out these markup coding problems in a number of different ways, such as with a malformed Snippet and with the following.
Ergo, my previously stated position on W3C Validation being a total waste is still substantially correct. I only need to make a slight revision to my published position. All that remains is working out the logistics of publishing a few slight revisions( ie, on a New Blog or simply revise the old one).
Optionally, EVERYONE can safely STOP W3C Validating their web pages once they have passed the Header Section of each web page. Or in other words, the real guts of each web page does not need to be W3C validated at all. This line of reasoning provides a logical explanation as to why Amazon.com's 1,000+ validations errors does not affect its profitability in the least, that is perfectly valid.
The devil is in the details of the SEO scam, known as W3C Validation. Isn't it very revealing that the W3C Validation police do NOT publicize what website operators really need to know on this issue? I find that fact very revealing.
OK. Enough. I showed you the stuff with the malformed closing tag of the title element and the effect it has on the SERP.
I can't show you the really fucked up pages where the opening tag is malformed, and therefore the title is completely missed as being the title, because there is no search that can bring them up.
Likewise there are millions of pages where the on-page body tags are in error and chunks of content are missed by the bots, and not indexed. Vanessa Fox had one such error on her blog last year that I pointed out at the time.
@g1smd
Try using the text-only version of Google's cache copy, to check / review the text of newly added webpages.
Now visualize the time it would take to work through 1,666 Validation Errors particularly if you have never validated a webpage before.
As ColinCochrane has already pointed out, don't screw up your HTML and you wont have any problems. And, probably the biggest source of problems in that area for static websites would be improper use of search and replace utilites.
There is always a price to be paid for making HTML coding mistakes. The question is whether or not wasting a great deal of time on W3C Validation is going to offset the value of catching what should be in most situations be a very low number of serious errors.
Back when I used to build completely static sites, I used to run every page of the site through the W3C HTML Validator. That was a time-consuming job, to cut and paste each URL into the checker box and hit the button.
Soon after that, I changed the way that I worked. I placed a link to the HTML Validator on every page of the site. This link had the /check/referrer/ parameter added on the end. That did away with the cut and paste action, saved a lot of time, and led to being able to check any page with just one click.
Once I moved on to dynamic sites, that link migrated to the common footer so that I didn't have to do any work to include it on new pages. It appeared automatically. By now it was rare to find any errors in my code, just the occasional typo or cut and paste error.
More recently I installed the HTML Validator extension for Mozilla Firefox. That checks every page as it is visited, and puts up a Green Tick if everything is in order, a Yellow Exclamation Mark if there are warnings to be read, and a Red Cross if there are errors to be fixed. It takes no extra time at all.
Validating a site is now as simple as clicking every link within the site to visit each page in turn and making sure the Green Tick comes up. For a dynamic site, especially template-driven, or CMS stuff, only a small sample of pages ever needs to be checked anyway.
hi..In short, W3C Validation is a total scam that is being promoted literally on thousands of different SEO websites and blogs. W3C Validation does absolutely nothing, but tear the down the reputation of the SEO Industry in general...
and one more thing is your site ghaving good technical iinformation
Perhaps you could tidy that post to show which bit is a quote and which is your thoughts?
You get 15 minutes from when you post to fix it.
Also interesting to think about is the possibility that sites that have many errors would tend to be sites that are more poorly developed overall with less internal linking,
@g1smd
That reminded me that I had the Web Developer ToolBar already installed which appears to do just about everthing imaginable.
Checking Validation requires an extra step in this toolbar. Yields ho hum results, as expected. What I really need is a validator that would only flags major validation errors that could mess up your Google ranking.
One interesting tidbit that I came across was that the home page of my Google Page website validated perfectedly. How unexpected?