Once a document has been crawled, it is saved into a temporary database to be later processed by the indexer. Depending on the search engine, indexing involves applying multiple algorithms to determine the topic of the page, its quality, and how useful a user is likely to find it. In general, the processes involved are:
Extracting actual unique content – this algorithm attempts to differentiate between the website’s boilerplate template & navigational elements, and the actual unique page content.
Visual rendering – this algorithm renders the page in the same way a regular browser would in order to determine where individual elements are placed on the page, which elements are more prominent than others and where the actual content sits.
Quality analysis – this algorithm analyses the content of the page to determine its quality and how a user might react to it. While these algorithms are good at determining the true value of the content (for instance, analysing the text content for reading age, grammar or spelling mistakes), search engines have traditionally used alternative measures to determine quality. For instance, if pages with prominent headings, good paragraph spacing, and images throughout have traditionally resulted in a good user experience, a page with a similar layout may be given a higher quality score than a page with a thousand words and no paragraph spacing or other design elements.
Semantic processing – this algorithm attempts to determine what the page is actually about. It extracts the core concepts of the content and connects them with semantically related concepts – for instance, a block of content about bathroom sinks will probably be closely associated with bathroom basins.
User intent – this algorithm attempts to assign an ideal intent to the page. For instance, a product page for a DVD player that features specifications, reviews and a prominent ‘add to cart’ button will be assigned a ‘purchase’ intent. However, an article that talks about how DVD players work, features many images and contains references to other similar documents will be assigned an ‘informational’ intent. This intent classification is then used during the information retrieval and ranking process.
Spam filters – search engines also contain filters which have been built up to combat content spam – that is, keyword stuffing (unnaturally using a keyword lots of times in the hope that it will make the page rank better), hiding content (so that search engines see it but users do not) and looking for footprints that may indicate the website has been hacked. This algorithm will generally assign a ‘content spam’ rating to the document that can be used during the ranking process.Next: Indexing >