Manual:Robots.txt

< Apache configuration The robots.txt file tells web robots how to index your site. The robots.txt file is placed in the root of your MediaWiki (the folder were your Mediawiki is).

The robots.txt file is a Robots Exclusion Standard.

Exclude all robots from the server
To exclude all robots from your server add the following to the robots.txt file and upload it to the root of your MediaWiki: User-agent: * Disallow: /

Using URL rewriting
If using a system like on Wikipedia where plain pages at /wiki/Some_title and anything else is at /w/index.php?title=Some_title&someoption=blah, restricting bots is easy:

User-agent: * Disallow: /w/

Be careful, though! If you put this line by accident:

Disallow: /w

you'll block access to the /wiki directory, and search engines will drop your wiki!

Long page names (URLs)
In you use a robots.txt file, it would be wise to deny the robot access to the script directory. This includes the differences, old revisions, contributions lists, etc. If the bot is allowed access to these directories, it could severely raise the load on your server as the bots search and index these directories.

If your mediawiki web addresses have not been shortened with Manual:Short URL, restricting robots requires more lines of text in the robots.txt file.

Here is an aggressive example of keeping robots' out of non-core pages: User-agent: * Disallow: /index.php?diff= Disallow: /index.php?oldid= Disallow: /index.php?title=Help Disallow: /index.php?title=Image Disallow: /index.php?title=MediaWiki Disallow: /index.php?title=Special: Disallow: /index.php?title=Template Disallow: /skins

The lines without the colons at the end restrict those pages' talk pages.

In addition non-English wikis may need to add various translations of the above lines.

We also add: Disallow: /*& because some robots like Googlebot accept this wildcard extension to the robots.txt standard, which indeed stops most of what we don't want robots sifting through, just like the /w/ solution above.

Problems
Unfortunately, there are three big problems with robots.txt:

Rate control
You can only specify what paths a bot is allowed to spider. Even allowing just the plain page area can be a huge burden when two or three pages per second are being requested by one spider over two hundred thousand pages.

Some bots have a custom specification for this; Inktomi responds to a "Crawl-delay" line which can specify the minimum delay in seconds between hits. (Their default is 15 seconds.)

Bots that don't behave well by default could be forced into line with some sort of request throttling.

Don't index vs don't spider

 * Note: it seems this section may be outdated. At any rate,  match no documents, while  match loads.

Most search engine spiders will consider a match on a robots.txt 'Disallow' entry to mean that they should not return that URL in search results. Google is a rare exception, which is technically to specs but is very annoying: it will index such URLs and may return them in search results, albeit without being able to show the content or title of the page or anything other than the URL.

This means that sometimes "edit" URLs will turn up in Google results, which is very VERY annoying.

The only way to keep a URL out of Google's index is to let Google crawl the page and see a meta tag specifying robots="noindex". Although this meta tag is already present on the edit page HTML template, Google does not spider the edit pages (because they are forbidden by robots.txt) and therefore does not see the meta tag.

With our current system, this would be difficult to special case. It would be technically possible to exclude the edit pages from the disallow line in robots.txt, but this would require reworking some functions.

Evil bots
Sometimes a custom-written bot isn't very smart or is outright malicious and doesn't obey robots.txt at all (or obeys the path restrictions but spiders very fast, bogging down the site). It may be necessary to block specific user-agent strings or individual IPs of offenders.

Consider also request throttling.

Probably the best option in the case of 'bad bots' is to write a spider trap The idea is that you deny a yourdomain.tld/trap/ directory to robots in robots.txt then write a small script that logs any IP that tries to access the /trap/ directory and adds that IP to the robots.txt in the previous folder. Thus, any robot ignoring robots.txt is IP banned permanently!

A somewhat outdated description of a spider trap is available here.

Blocking via .htaccess
If the robot does not obey robots.txt, we may still for example enforce the above Disallow: /*& line via Apache's .htaccess file:  RewriteEngine on RewriteCond %{QUERY_STRING} & RewriteCond %{HTTP_USER_AGENT} http://.*\.com [OR] RewriteCond %{REMOTE_ADDR} ^124\.115\.0 RewriteRule. - [F]  This blocks access to any page with a "&" in the URL for all robots and the specific nasty IP address match.

We are guessing here that all HTTP_USER_AGENTs with a URL embedded are robots (although the reverse is not true).

One could also guess that all HTTP_USER_AGENTs with a email address are robots, and match on "@" too.

Revenge via .htaccess
Some go further and actually take revenge via .htaccess. However such schemes might end up becoming springboards.