Indexing

Once the engine has finished processing the document, it must tokenize the content and add the tokens to a database for fast retrieval later (when a user conducts a search).

Tokenization is a process in which words are reduced to their source token – for instance “location”, “locating”, and “located” are all reduced to the token “locat”. Tokenising words to their core meaning reduces the amount of resources needed to store and retrieve pages – prior to tokenising, duplicate records would have to be kept for every version of a word. In some occasions, words may be tokenised to a completely different word where they are perfect synonyms. For instance, “daipers” and “nappies” may be considered to be duplicates, so are merged into a singer “daip” token.

Stopwords: During the tokenization process, “stopwords” are discarded – these are extremely common words such as “the”, “and”, “when” which do not contain any relevancy clues.

While each specific engine has its own indices and ways of storing information, traditional compression and data storage convention tells us that there are likely to be three main types of indices: reverse, forward and full-text.

Reverse Index

The content is tokenized and the document’s ID is placed into a database that contains every token on a row. Multiple reverse indices may exist to facilitate the storage of other data – for instance core concepts, traditional quality scores or page intent types.

Document Index

Token Documents
potato 2332,346,54577,32344,377,956…
purple 54577,344,267,95,1222,55…
london 372,467,2333,8864,54577…

Page Intent Index

Intent Documents
Purchase 433536,3434,2356,3565,54577,532,444…
Informational 123,3333,56378,26954,3434…

In this way, it is easy for the search engine to later retrieve all documents that may fit a user’s search query (i.e. potatoes in London, which have purchase intent).

Forward Index

Another common database is the forward index, which is, unsurprisingly, the opposite of a reverse index: the words and concepts found within a document are stored on a single row. The forward index is useful for quickly re-analysing documents at the point of ranking.

Document (ID) Tokens
54577 purple,potato,london,tottenham,court,…

Cache/Full Text Index

While the forward and reverse indices store only tokens and concepts, the cache index stores a highly compressed version of the full text of the document (including all HTML coding and metadata). This is saved and may be used later for advanced change analysis, re-indexing, or serving to a user (such as when a page cache is requested). While regular indicies have many copies (to reduce access time), a large search engine may only store a full text cache in one or two locations worldwide – they are relatively low priority.

Because of the computing expense of decompressing and parsing full-text caches is so high, these are not typically used in real-time ranking algorithms.