Help:Robots.txt

Other languages:

What is robots.txt?[edit source]

It determines if and when search engine web crawlers can visit a website's pages and include them in the the search engine's index.

How can I modify it?[edit source]

You can modify your own robots.txt from your wiki, on the page MediaWiki:Robots.txt. This will append to our global robots.txt. MediaWiki will never allow indexing of any special pages, or api.php.

What can I put in it?[edit source]

Robots.txt supports many indexing related keywords. This includes:

User-agent:[Required, one or more per group] The directive specifies the name of the automatic client known as search engine crawler that the rule applies to. This is the first line for any rule group. Google user agent names are listed in the Google list of user agents. Using an asterisk (*) will match all crawlers except the various AdsBot crawlers, which must be named explicitly.^[1]
Disallow:[At least one or more Disallow or Allow entries per rule] A directory or page, relative to the root domain, that you don't want the user agent to crawl. If the rule refers to a page, it should be the full page name as shown in the browser; if it refers to a directory, it should end in a / mark.^[1]
Allow:[At least one or more Disallow or Allow entries per rule] A directory or page, relative to the root domain, that may be crawled by the user agent just mentioned. This is used to override a Disallow directive to allow crawling of a subdirectory or page in a disallowed directory. For a single page, the full page name as shown in the browser should be specified. In case of a directory, the rule should end in a / mark.^[1]

What is the format I should use?[edit source]

Common practice would be to use each new rule on a new line. Some search engines recognize some patterns you can use in robots.txt. For example, Disallow: /*example$ will match pattern. * means wildcard, indicating that part of the rule can match any part of the URL. $ is used to indicate the URL must end that way.

What are some examples?[edit source]

This example will disallow crawling for the user-agent in User-agent, and the URL string from Disallow:

User-agent: [user-agent name]
Disallow: [URL string not to be crawled]

This example will disallow crawling for all supported user agents, and the URL string, /example/:

User-agent: *
Disallow: /example/

This example will disallow crawling for all supported user agents, and all content:

User-agent: *
Disallow: /

Where can I find it?[edit source]

Robots.txt can always be found at subdomain.miraheze.org/robots.txt or mycustomdomain.tld/robots.txt. If you recently had your wiki switched to a custom domain, it may take a few days before your robots.txt file is available from the new custom domain.

References[edit source]

↑ ^1.0 ^1.1 ^1.2 Create a robots.txt file – Google Developers

[google-developers-1] 1.0 ^1.1 ^1.2 Create a robots.txt file – Google Developers

[1]