Examples[edit | edit source]
Prevent all indexing[edit | edit source]
This code prevents all bots from indexing all pages on your site:
User-agent: * Disallow: /
If you only want to block a certain spider, replace the asterisk with the spider's user agent.
Prevent indexing of non-article pages[edit | edit source]
MediaWiki generates many pages that are only useful for live humans: old revisions and diffs tend to duplicate content found in articles. Edit pages and most special pages are dynamically generated, which makes them useful only to human editors and relatively expensive to serve. If not directed otherwise, spiders may try to index thousands of similar pages, overloading the webserver.
With short URLs[edit | edit source]
It is easy to prevent spiders from indexing non-article pages if you are using Wikipedia-style short URLs. Assuming articles are accessible through /wiki/Some_title and everything else is available through /w/index.php?title=Some_title&someoption=blah:
User-agent: * Disallow: /w/
Be careful, though! If you put this line by accident:
you'll block access to the /wiki directory, and search engines will drop your wiki!
Without short URLs[edit | edit source]
If you are not using short URLs, restricting robots is a bit harder. If you are running PHP as CGI and you have not beautified URLs, so that articles are accessible through /index.php?title=Some_title:
User-agent: * Disallow: /index.php?diff= Disallow: /index.php?oldid= Disallow: /index.php?title=Help Disallow: /index.php?title=Image Disallow: /index.php?title=MediaWiki Disallow: /index.php?title=Special: Disallow: /index.php?title=Template Disallow: /skins/
If you are running PHP as an Apache module and you have not beautified URLs, so that articles are accessible through /index.php/Some_title:
User-agent: * Disallow: /index.php? Disallow: /index.php/Help Disallow: /index.php/MediaWiki Disallow: /index.php/Special: Disallow: /index.php/Template Disallow: /skins/
The lines without the colons (:) at the end restrict those namespaces' talk pages.
Non-English wikis may need to add various translations of the above lines.
You can also try
because some robots like Googlebot accept this wildcard extension to the robots.txt standard, which stops most of what we don't want robots sifting through, just like the /w/ solution above.
Problems[edit | edit source]
Unfortunately, there are two big problems with robots.txt:
Rate control[edit | edit source]
You can only specify what paths a bot is allowed to spider. Even allowing just the plain page area can be a huge burden when two or three pages per second are being requested by one spider over two hundred thousand pages.
Some bots have a custom specification for this; Inktomi responds to a "Crawl-delay" line which can specify the minimum delay in seconds between hits. (Their default is 15 seconds.)
Evil bots[edit | edit source]
Sometimes a custom-written bot isn't very smart or is outright malicious and doesn't obey robots.txt at all (or obeys the path restrictions but spiders very fast, bogging down the site). It may be necessary to block specific user-agent strings or individual IPs of offenders.
More generally, request throttling can stop such bots without requiring your repeated intervention.
An alternative or complementary strategy is to deploy a spider trap.
|Language:||English • polski|