Published: Nov 28, 2007 - 06:07 am
Story Found By: steaprok 1537 Days ago
Category: SEM
3 Comments
3 Comments
Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.
Join us at an upcoming SMX event:
Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:
Comments
This seems like it conflicts with the "Spam Mass" patent that already exists. I believe its held by Yahoo?
Thanks for sphinning this, steaprok.Im not sure that theres a large conflict with Yahoos paper, but if they use it, the mass estimation approach is likely only part of what they do.The patent describes a somewhat different approach than the Spam Detection Based on Mass Estimation paper proposes (http://infolab.stanford.edu/~zoltan/publications/gyongyi2006link.pdf). But there do seem to be some similarities. Instead of grouping the web graph into nodes (or perhaps specific hosts or domains), like the Yahoo paper suggests, the Google patent groups them into clusters in a few possible ways. Were not given a lot of details in the patent on how that is done, exactly. But we are provided with a few possible alternatives, including one that determines clusters by "computing a dense bipartite subgraph of articles comprising doorway articles and target articles, wherein the doorway articles contain links to the target articles" (more on bipartite subgraphs below).The Google approach described also seems like a broader overview, and looks at links and at content found on pages, while the Yahoo method is more focused, and seems to limit itself to link analysis. That doesnt mean that Yahoo wouldnt also use some kind of content analysis, and they even suggest that "we conjecture that many false positives could be eliminated by complementary (textual) content analysis."Im not sure that the Google patent provides enough detail on their clustering to really compare it to the Mass Estimation method. The Google approach seems a little broader, and may be influenced by the kind of thinking expressed in 1999s "Mining the Link Structure of the World Wide Web" (http://citeseer.ist.psu.edu/213063.html) in a section which is in a box in the middle of the paper, and is titled "Trawling emerging cybercommunities automatically." Monica Henzinger (one of the inventors on the Google patent) refers to that section of that paper as providing an example of the bipartite subgraphs in a paper that she wrote - Challenges in Web Search (http://www.sigir.org/forum/F2002/henzinger.pdf). She tells us there that:Typically, link-spam sites have certain patterns of links that are easy to detect, but these patterns can mutate in much the same way as link spam detection techniques. A less heuristic approach to discovering link spam is required. One possibility is, as in the case of text spam, to use a more global analysis of the web instead of merely local page-level or site-level analysis. For example, a cluster of sites that suddenlyt sprout thousands of new and interlinked webpages is a candidate link-spam site. The work by Ravi Kumar et al. [16] on finding small bipartite clusters in the web is a first step in this direction.(The footnote leads the the "trawling" paper.)That paper, which she uses an example from, to write about bipartite clusters had a number of co-authors. It looks like at least three or four of them may presently be at Yahoo. :)Google and Yahoo may not be doing the same things to combat Web spam, but the universe of search engineers at major search engines who work on spam issues is a pretty small one, and Id think that we have to assume that while they may be using different approaches, they probably have some idea of what the folks at other search engines are doing to fight web spam.
@ Bill , my pleasure. It is a fantasticly in depth post, chock full of info.