Sphinn Home » Google
"A patent granted to Google today explores Web spam and the manipulation of documents and links on the Web. It describes how the rankings of pages may be influenced if they are identified as “manipulative.”
3 Comments     

Comments

from LilRascal 585 days ago #
Votes: 0 | Vote:
+ -

This seems like it conflicts with the "Spam Mass" patent that already exists. I believe it's held by Yahoo?

from billslawski 585 days ago #
Votes: 2 | Vote:
+ -

Thanks for sphinning this, steaprok.


I'm not sure that there's a large conflict with Yahoo's paper, but if they use it, the mass estimation approach is likely only part of what they do.

The patent describes a somewhat different approach than the Spam Detection Based on Mass Estimation paper proposes (http://infolab.stanford.edu/~zoltan/publications/gyongyi2006link.pdf).  But there do seem to be some similarities.  Instead of grouping the web graph into nodes (or perhaps specific hosts or domains), like the Yahoo paper suggests, the Google patent groups them into clusters in a few possible ways. 


We're not given a lot of details in the patent on how that is done, exactly.  But we are provided with a few possible alternatives, including one that determines clusters by "computing a dense bipartite subgraph of articles comprising doorway articles and target articles, wherein the doorway articles contain links to the target articles" (more on bipartite subgraphs below).


The Google approach described also seems like a broader overview, and looks at links and at content found on pages, while the Yahoo method is more focused, and seems to limit itself to link analysis.  That doesn't mean that Yahoo wouldn't also use some kind of content analysis, and they even suggest that "we conjecture that many false positives could be eliminated by complementary (textual) content analysis."


I'm not sure that the Google patent provides enough detail on their clustering to really compare it to the Mass Estimation method.  The Google approach seems a little broader, and may be  influenced by the kind of thinking expressed in 1999's "Mining the Link Structure of the World Wide Web" (http://citeseer.ist.psu.edu/213063.html) in a section which is in a box in the middle of the paper, and is titled "Trawling emerging cybercommunities automatically."
 

Monica Henzinger (one of the inventors on the Google patent) refers to that section of that paper as providing an example of the bipartite subgraphs in a paper that she wrote - Challenges in Web Search (http://www.sigir.org/forum/F2002/henzinger.pdf). 

She tells us there that:

Typically, link-spam sites have certain patterns of links that are easy to detect, but these patterns can mutate in much the same way as link spam detection techniques.  A less heuristic approach to discovering link spam is required.  One possibility is, as in the case of text spam, to use a more global analysis of the web instead of merely local page-level or site-level analysis.  For example, a cluster of sites that suddenlyt sprout thousands of new and interlinked webpages is a candidate link-spam site.  The work by Ravi Kumar et al. [16] on finding small bipartite clusters in the web is a first step in this direction.

(The footnote leads the the "trawling" paper.)

That paper, which she uses an example from, to write about bipartite clusters had a number of co-authors.  It looks like at least three or four of them may presently be at Yahoo. :)

Google and Yahoo may not be doing the same things to combat Web spam, but the universe of search engineers at major search engines who work on spam issues is a pretty small one, and I'd think that we have to assume that while they may be using different approaches, they probably have some idea of what the folks at other search engines are doing to fight web spam.


from steaprok 585 days ago #
Votes: 0 | Vote:
+ -

@ Bill , my pleasure. It is a fantasticly in depth post, chock full of info. 


Log in to comment or register here.

Sphinn Sponsors

Be a Sphinn Sponsor - Click Here

Search Marketing Expo

Save the date for:
SMX Singapore - July 2-3, 2009
SMX São Paulo - August 4-5
SMX East - October 5-7, 2009
SMX Stockholm - 12-13 October, 2009
SMX Mexico - November 11, 2009

Search Marketing Now

Learn more about search marketing through free online webcasts and webinars from our sister site Search Marketing Now.

Upcoming Webcasts: