Manual talk:Robots.txt

evil google ?
the paragraph about the evil googlebot (in the "evil bots" section) seems completely nonsensical. The author apparently does not understand what he is talking about. his referenced robots.txt is syntactically wrong and the pages he refers to are either nonexistent or don't say what he claims they say. I suggest removes that remark, since it is either wrong or outdated.

Don't index vs don't spider
"The only way to keep a URL out of Google's index is to let Google slurp the page and see a meta tag specifying robots="noindex". With our current system, this would be difficult to special case."


 * As nonexistent articles mostly bring up an edit page, can we not just set that robots="noindex" meta tag on the edit page HTML template? This way, the meta tag would be there on all edit pages, so none of them will get indexed. Ropers 18:15, 28 Aug 2004 (UTC)


 * We already do. The issue discussed above is that Google returns search results including URLs that are forbidden by robots.txt. Because they are forbidden by robots.txt, Google does not spider the pages and does not see the meta tag. --Brion VIBBER 21:19, 28 Aug 2004 (UTC)


 * Ah. I misunderstood earlier. But then, can we not just do away with any mention of edit pages in robots.txt (which is what I think was proposed above by "letting Google slurp the page")? Ropers 21:30, 28 Aug 2004 (UTC)


 * This would require making all edit URLs have a distinct prefix which can be excluded from the disallow line in robots.txt. Possible but needs to do some reworking to some functions. --Brion VIBBER 00:35, 29 Aug 2004 (UTC)

Evil Bots
Why not ban evil bots using htaccess? I have adapted MediaWiki's robots.txt file for my Wiki and wondered if it would be possible to find out what user agents and IP addresses MediaWiki has discovered are evil so that I also can ban them from my wiki using htaccess. Yes I know it would be possible to build a list using server logs, but having an existing list to start with would be even better. Lavishluau

Random vs. Randompage
robots.txt on all MediaWikis contain: Disallow: /wiki/Special:Randompage Disallow: /wiki/Special%3ARandompage

But the link to Random page is /wiki/Special:Random, so shouldn't robots.txt contain the lines underneath?

Disallow: /wiki/Special:Random Disallow: /wiki/Special%3ARandom

The current setting lets Google index random pages. Now when you select a search result, Google will serve you another (random) page - NOT the page as suggested in the search results. For example, search Google for:

allinanchor:"Special:Random" site:wikipedia.org

and click on link '../wiki/Special:Random'.

Fvue 08:10, 28 July 2006 (UTC)

how to set up a url like in wikipedia?

 * If using a system like on Wikipedia where plain pages are arrived at via /wiki/Some_title and anything else via /w/wiki.phtml?title=Some_title&someoption=blah, it's easy:

but how would i do that? i couldn't find any. the closest i found was Using a very short URL, which isn't exactly this.

Answer: to how to set up a url like in wikipedia?

I have just made it for my site http://www.wikisuccess.org/. Here is how you can do it:


 * MOVE your site from / to /w/


 * ADD/CHANGE in DefaultSettings.php & Localsettings.php:

APPLY for all require or include this format ( $IP has to be there! ): require_once( "$IP/includes/someincludedfile.php" );


 * DefaultSettings.php (logo & icon path):

$wgLogo = 'http://'.$wgServerName.'/w/skins/common/images/wiki.jpg'; $wgFavicon = 'http://'.$wgServerName.'/w/icon.gif';


 * Localsettings.php:

$wgScriptPath      = "/w"; $wgScript          = "$wgScriptPath/index.php"; $wgRedirectScript  = "$wgScriptPath/redirect.php";

$wgArticlePath = "/wiki/$1";


 * .htaccess: (rewrite & redirect site.com to site.com/w/ )

DirectoryIndex w/index.php RewriteEngine On RewriteRule ^(images|skins)/ - [L] RewriteRule \.php$ - [L] RewriteRule ^wiki/?(.*)$ w/index.php?title=$1 [L,QSA]

User-agent: * Disallow: /w/
 * robots.txt:


 * works for me nicely here http://www.wikisuccess.org/

Viktorados 23:47, 18 February 2007 (UTC)

Robots.txt and user pages
I'm not sure whether this is is best place to take it, but I was just taking a look at the Robots.txt file at en.wp here. Why was the exclusion of user pages from the Wayback Machine commented out? There are pretty good reasons to delete and oversight some user pages out there. Anyone know? Thanks. 70.112.5.220 06:26, 18 December 2007 (UTC)