Manual:Robots.txt/zh

robots.txt是机器人例外标准的一部分. 它描述网络蜘蛛如何对一个站点建立索引. robots.txt必须被放置在网站根目录下.

阻止建立所有索引
组织所有机器人建立站点页面索引的代码： 如果您仅仅想阻止某些网络蜘蛛，用该蜘蛛的User Agent代替“ * ”.

阻止对非文章页面建立索引
MediaWiki generates many pages that are only useful for live humans: old revisions and diffs tend to duplicate content found in articles. Edit pages and most special pages are dynamically generated, which makes them useful only to human editors and relatively expensive to serve. If not directed otherwise, spiders may try to index thousands of similar pages, overloading the webserver.

使用短URL
It is easy to prevent spiders from indexing non-article pages if you are using Wikipedia-style short URLs. Assuming articles are accessible through /wiki/Some_title and everything else is available through /w/index.php?title=Some_title&someoption=blah: Be careful, though! If you put this line by accident: you'll block access to the /wiki directory, and search engines will drop your wiki!

不使用短URL
If you are not using, restricting robots is a bit harder. If you are running PHP as CGI and you have not beautified URLs, so that articles are accessible through /index.php?title=Some_title: If you are running PHP as an Apache module and you have not beautified URLs, so that articles are accessible through /index.php/Some_title: The lines without the colons at the end restrict those namespaces' talk pages.

非英语语言的维基可能需要在这行上方加入翻译语句.

你也可以尝试 because some robots like Googlebot accept this wildcard extension to the robots.txt standard, which stops most of what we don't want robots sifting through, just like the /w/ solution above.

Allow indexing of raw pages by the Internet Archiver
You may wish to allow the Internet Archiver to index raw pages so that the raw wikitext of pages will be on permanent record. This way, it will be easier, in the event the wiki goes down, for people to put the content on another wiki. You would use:

速率控制
You can only specify what paths a bot is allowed to spider. Even allowing just the plain page area can be a huge burden when two or three pages per second are being requested by one spider over two hundred thousand pages.

Some bots have a custom specification for this; Inktomi responds to a "Crawl-delay" line which can specify the minimum delay in seconds between hits. (Their default is 15 seconds.)

不遵守规定的机器人程序
Sometimes a custom-written bot isn't very smart or is outright malicious and doesn't obey robots.txt at all (or obeys the path restrictions but spiders very fast, bogging down the site). It may be necessary to block specific user-agent strings or individual IPs of offenders.

More generally, request throttling can stop such bots without requiring your repeated intervention.

An alternative or complementary strategy is to deploy a spider trap.

蜘蛛抓取与索引的对比
While robots.txt stops (non-evil) bots from downloading the URL, it does not stop them from indexing it. This means that they might still show up in the results of Google and other search engines, as long as there are external links pointing to them. (What's worse, since the bots do not download such pages, noindex meta tags placed in them will have no effect.) For single wiki pages, the  magic word might be a more reliable option for keeping them out of search results.