- 41
- Sphinn It!
Posted By: vangogh 182 days ago
Topic Type: News Story (Jump to http://www.searchenginepeople.com)
Category: SEO
Rudd's How Search Really Works Series.
2 Comments
2 Comments
Save the date for:
SMX China (Nanjing) - Sept. 23-24
SMX Stockholm - Sept. 23-24: See who's speaking or register now.
SMX East (New York City) - Oct.
6-8: See the agenda or register today and save!
SMX London - Nov. 4-5: Pre-agenda rate now available. Click here.
Comments
"New stuff you find on the web goes into another, more temporary index."
Not sure where you got this Ruud. Freshness is a factor that helps pages remain in the main index, not the other way around. Also, many pages, from my experience, start off in the main index, before turning supplemental.
"In such an index you can’t randomly insert new or updated documents or remove deleted ones. You would have to re-sort on every update."
An interesting argument but the conclusion at the very least assumes Google is avoiding to re-sort on every update.
An interesting argument but the conclusion at the very least assumes Google is avoiding to re-sort on every update.
Yup. And for the model described it's not an assumption but a factual conclusion :)
In some things that Google does, or did, you can recognize the fingerprints of setups common to information retrieval.
To make a sorted positional inverted index takes huge amounts of memory. The common way to deal with that is to sort it in parts and merge those parts back.
The common problem of that setup is that once you have sorted the index it becomes harder to insert fresh documents and dictionary terms in it.
And the common way to handle that situation is to work with two indexes, two dictionaries: one main one and one supplemental one with fresh, new stuff that hasn't yet been folded back into the main index.
These are all basic facts and this is how Google has worked for most of its life. What they got much, much better at was folding updates back into the main index up to the point where they didn't need a fresh crawl anymore.
This is just the basic technical setup; any search engine engineer can tweak or improve the way this works. For example, I can take less popular (= less useful) information out of the main index, into a supplemental . It becomes a form of lossy compression; take the top 1000 postings for a query and discard the ones under it, keeping them ready to search only when the top 1000 doesn't answer that.
In that version of the model freshness or PageRank or indeed any other factor can be used to determine whether it's worth the resources to keep working with a specific document.
Freshness is a factor that helps pages remain in the main index, not the other way around. Also, many pages, from my experience, start off in the main index, before turning supplemental.
I think you're refering to supplemental results, not a or the supplemental index.
Besides which, although these days the basic framework still is pretty much as described in my post, something very essential has changed in a big (daddy) way; the way Google store the index.
The way they (now) generate and store the index and the dictionary has made main and supplemental as obsolete as ignoring stop words.
Of course more about that in next week's post :)