Perhaps if the Googlebots segregated the news feed files and the regular web pages in separate databases, we would see less duplicated content issues that push web pages into the Supplemental Index.
4 Comments
4 Comments
4 Comments
Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.
Join us at an upcoming SMX event:
Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:
Comments
A well-reasoned post. Agreed that both pages and feeds serve human audiences, but in slightly different ways. So will it be the lesser of two evils: blocking feeds via robots.txt,which will certainly limit overall visibility against the potential decrease in visibility because of pages going supplemental?
There certainly is that trade-off analysis to be done. However if Im understanding correctly, the RSS feed itself only recently has been included in the regular search. You still have the possibility that the regular post page will appear in the regular SERP. So it would be natural to have xml type files (rdf, etc.) appearing only in Blogsearch and the full posts (html or whatever) as the only ones appearing in the regular search.
Ive seen XML files like RSS feeds in Web search for a while now, probably 1-2 years. When an XML file comes up for a search usually the source had an optimization problem, was not (yet) indexed, or went supplemental. When XML files appeared on the SERPs they were handled as an "unknown file type", that means that the indexer just scored the textual contents ignoring the XML structure and the meaning of XML tags referring the source. Some XML files have a weird PageRank just from the linked buttons, so it "makes sense" that they rank better than the source. Also, esp. with full feeds sometimes the XMLs textual contents *are* more relevant than a single item one wants to see on the SERP, for example when the search query respectively its context spans more than the actual post. I dont think thats a crawling problem, and it has nothing to do with the crawling engines cache. The Web indexing process needs to "learn more" about XML and feed semantics;) And when a feed pops up in a raw result set, the query engine should be able to lookup the source (item) to provide the searcher with an indent result.
Ive mentioned this Open Letter to Matt Cutts in the comments to his Minty Fresh Indexing post. Seems to me whats in which index is a critical question in these matters.