Crawling

Search crawlers, or ‘spiders’, move across the Internet by downloading a document (for all intents and purposes, each unique URL can be considered a document), extracting any citations or links to other documents, and deciding which links to add to the crawl queue. The spider is the only touch point that search engines have with the rest of the Internet; once the source code of a document has been retrieved, it is saved in a temporary database and later processed by the indexer.

URL citations vs links

Both regular links and plain-text URL citations (i.e. the text http://www.domain.com/page.html) are treated as hyperlinks, and will pass link equity.

Crawling

Crawling is a relatively cheap and easy process, and to make it even more efficient search engines will typically enforce rules that mean that they do not unnecessarily crawl pages:

The first rule is that they will only crawl a URL if it passes a ‘link equity’ threshold. Link equity is an arbitrary trust value that passes between documents through hyperlinks. A document will accrue its link equity through the other documents linking to it – the more high quality links a document has, the higher its link equity score will be. If a page has a low link equity score, it is assumed that it is unpopular enough to appear in search results and so does not need to be crawled.

The next rule which crawlers follow when deciding whether or not to crawl a page is how often the document has historically changed: even the most popular page in the world will not be crawled often if it only changes once every few years, whereas less popular pages may be crawled more often if their content changes frequently.