The robots.txt file is a directive file that tells search crawlers what they can and cannot access. It should always exist on the domain (or subdomain) root and be accessible from the web. Search crawlers will normally try to download your robots.txt file before any other pages on the site; if the file cannot be found or contains errors, the search crawler will delay its crawl of the site until it can successfully process the robots file.
Robots.txt URL: http://www.example.com/robots.txt.
The robots.txt file is split into sections based on the robot’s user-agent. Only the most specific section for a bot applies – googlebot will ignore the directives for all robots if a section of googlebot-specific directives exists.
|User-agent||The user agent substring that the next set of commands apply to. i.e. ‘googlebot’, ‘googlebot-images’. (‘*’ means all robots.)|
|Disallow and Allow||Disallow (or allow) the bot from accessing specific files or directories.|
|NoIndex (unofficial)||NoIndex is supported unofficially by Google. It allows for URL patterns to be deindexed via robots.txt|
|Sitemap||A reference to the XML sitemap.|
|*||The wildcard character can be used within the URI, /search/*?text= for example.|
User-agent: * Disallow: /admin Disallow: /files/secret-plans.pdf User-agent: googlebot Disallow: /admin Disallow: /files/secret-plans.pdf Disallow: /dont-show-to-google Disallow: /*?search= User-agent: googlebot-images Disallow: /admin Sitemap: http://www.example.com/sitemap.xml
Uses of robots.txt
Control crawling (Disallow:)
Robots.txt is most commonly used to prevent search engine crawlers from attempting to access certain parts of a website.
- Preventing access to pages which are dynamically generated (such as search result pages), low value/duplicated
- Preventing access to development and staging environments
- Stopping search crawlers from accessing crawler traps (such as a never-ending calendar)
- Preventing crawling of potentially insecure files (such as the WordPress plugins folder)
Control indexing (NoIndex:)
While there is very little written about it, our empirical studies show that Google unofficially supports the “NoIndex” robots.txt directive. This is distinct from a page-level noindex (which exists in the HTML head or HTTP header) and allows SEOs to easily encourage the removal of URLs from the index en masse without forcing a crawler to individually visit all pages.
User-agent: * NoIndex: /path/*?*sortby=Next: XML Sitemaps >