Published: Mar 08, 2008 - 02:27 pm
Story Found By: vangogh 1541 Days ago
Category: SEO
Rudds How Search Really Works Series.
2 Comments
2 Comments
Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.
Join us at an upcoming SMX event:
Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:
Comments
"New stuff you find on the web goes into another, more temporary index."Not sure where you got this Ruud. Freshness is a factor that helps pages remain in the main index, not the other way around. Also, many pages, from my experience, start off in the main index, before turning supplemental."In such an index you can’t randomly insert new or updated documents or remove deleted ones. You would have to re-sort on every update."An interesting argument but the conclusion at the very least assumes Google is avoiding to re-sort on every update.
An interesting argument but the conclusion at the very least assumes Google is avoiding to re-sort on every update.Yup. And for the model described its not an assumption but a factual conclusion :)In some things that Google does, or did, you can recognize the fingerprints of setups common to information retrieval.To make a sorted positional inverted index takes huge amounts of memory. The common way to deal with that is to sort it in parts and merge those parts back.The common problem of that setup is that once you have sorted the index it becomes harder to insert fresh documents and dictionary terms in it.And the common way to handle that situation is to work with two indexes, two dictionaries: one main one and one supplemental one with fresh, new stuff that hasnt yet been folded back into the main index.These are all basic facts and this is how Google has worked for most of its life. What they got much, much better at was folding updates back into the main index up to the point where they didnt need a fresh crawl anymore.This is just the basic technical setup; any search engine engineer can tweak or improve the way this works. For example, I can take less popular (= less useful) information out of the main index, into a supplemental . It becomes a form of lossy compression; take the top 1000 postings for a query and discard the ones under it, keeping them ready to search only when the top 1000 doesnt answer that.In that version of the model freshness or PageRank or indeed any other factor can be used to determine whether its worth the resources to keep working with a specific document.Freshness is a factor that helps pages remain in the main index, not the other way around. Also, many pages, from my experience, start off in the main index, before turning supplemental.I think youre refering to supplemental results, not a or the supplemental index.Besides which, although these days the basic framework still is pretty much as described in my post, something very essential has changed in a big (daddy) way; the way Google store the index. The way they (now) generate and store the index and the dictionary has made main and supplemental as obsolete as ignoring stop words.Of course more about that in next weeks post :)