Manual talk:Robots.txt/Archive 1

From mediawiki.org
Latest comment: 11 years ago by ErkDemon in topic Images indexing

2004

Don't index vs don't spider

"The only way to keep a URL out of Google's index is to let Google slurp the page and see a meta tag specifying robots="noindex". With our current system, this would be difficult to special case."

As nonexistent articles mostly bring up an edit page, can we not just set that robots="noindex" meta tag on the edit page HTML template? This way, the meta tag would be there on all edit pages, so none of them will get indexed. Ropers 18:15, 28 Aug 2004 (UTC)

We already do. The issue discussed above is that Google returns search results including URLs that are forbidden by robots.txt. Because they are forbidden by robots.txt, Google does not spider the pages and does not see the meta tag. --Brion VIBBER 21:19, 28 Aug 2004 (UTC)
Ah. I misunderstood earlier. But then, can we not just do away with any mention of edit pages in robots.txt (which is what I think was proposed above by "letting Google slurp the page")? Ropers 21:30, 28 Aug 2004 (UTC)
This would require making all edit URLs have a distinct prefix which can be excluded from the disallow line in robots.txt. Possible but needs to do some reworking to some functions. --Brion VIBBER 00:35, 29 Aug 2004 (UTC)

http://www.okawards.com

I disallow any pages with "redlink" in the URL. Seems to work. ErkDemon (talk) 11:04, 6 September 2012 (UTC)Reply

2005

Evil Bots

Why not ban evil bots using htaccess? I have adapted MediaWiki's robots.txt file for my Wiki and wondered if it would be possible to find out what user agents and IP addresses MediaWiki has discovered are evil so that I also can ban them from my wiki using htaccess. Yes I know it would be possible to build a list using server logs, but having an existing list to start with would be even better. Lavishluau 28 September 2005

2006

Random vs. Randompage

robots.txt on all MediaWikis contain:

Disallow: /wiki/Special:Randompage
Disallow: /wiki/Special%3ARandompage

But the link to Random page is /wiki/Special:Random, so shouldn't robots.txt contain the lines underneath?

Disallow: /wiki/Special:Random
Disallow: /wiki/Special%3ARandom

Fvue 08:10, 28 July 2006 (UTC)Reply

The current setting lets Google index random pages. Now when you select a search result, Google will serve you another (random) page - NOT the page as suggested in the search results. For example, search Google for:
  allinanchor:"Special:Random" site:wikipedia.org
and click on link '../wiki/Special:Random'. --25 October 2006

how to set up a url like in wikipedia?

If using a system like on Wikipedia where plain pages are arrived at via /wiki/Some_title and anything else via /w/wiki.phtml?title=Some_title&someoption=blah, it's easy:
but how would i do that? i couldn't find any. the closest i found was Using a very short URL, which isn't exactly this. --16 October 2006

Answer: to how to set up a url like in wikipedia?
I have just made it for my site http://www.wikisuccess.org/. Here is how you can do it:
  • MOVE your site from / to /w/
  • ADD/CHANGE in DefaultSettings.php & Localsettings.php:
APPLY for all require or include this format ( $IP has to be there! ):
   require_once( "$IP/includes/someincludedfile.php" );
DefaultSettings.php (logo & icon path):
   $wgLogo = 'http://'.$wgServerName.'/w/skins/common/images/wiki.jpg'; 
   $wgFavicon = 'http://'.$wgServerName.'/w/icon.gif';
Localsettings.php:
   $wgScriptPath       = "/w";
   $wgScript           = "$wgScriptPath/index.php";
   $wgRedirectScript   = "$wgScriptPath/redirect.php";
   $wgArticlePath = "/wiki/$1";
.htaccess: (rewrite & redirect site.com to site.com/w/ )
   DirectoryIndex w/index.php
   RewriteEngine On
   RewriteRule ^(images|skins)/ - [L]
   RewriteRule \.php$ - [L]
   RewriteRule ^wiki/?(.*)$ w/index.php?title=$1 [L,QSA]
robots.txt:
   User-agent: *
   Disallow: /w/
works for me nicely here http://www.wikisuccess.org/
Viktorados 23:47, 18 February 2007 (UTC)Reply

2007

Robots.txt and user pages

I'm not sure whether this is is best place to take it, but I was just taking a look at the Robots.txt file at en.wp here. Why was the exclusion of user pages from the Wayback Machine commented out? There are pretty good reasons to delete and oversight some user pages out there. Anyone know? Thanks. 70.112.5.220 06:26, 18 December 2007 (UTC)Reply

2008

Who's in charge of maintaining this project's robots.txt?

Is this how I should contact the person who edits the global file for this project, or, if different, the person who would make modifications on behalf of fy.wikipedia.org (the Frisian wikipedia site)?

(If I'm supposed to ask on the Frisian site itself, I'd appreciate being given an idea where. So far I've gotten virtual blank stares. )

If you're wondering why, here's one example. Please compare the google results of:

site:wikipedia.org inurl:"wiki/Special:Random"

Good result.

with those of

site:wikipedia.org inurl:"wiki/Wiki:Random"

Oeps!

I.e., unlike the English site's syntax (wiki/Special:*) the Frisian syntax for the dynamically generated pages "search" and "load random page" (wiki/Wiki:*) is not covered in the "user agent * disallow" section.

Thanks much! (Tige tank! )

Winter 19:25, 24 March 2008 (UTC)Reply

I would try the wikitech-l mailing list; you could also try OTRS if your request doesn't get picked up. If you can catch the right person on #wikimedia-tech (IRC), you might be able to get this done very quickly. —Emufarmers(T|C) 21:22, 24 March 2008 (UTC)Reply

Images indexing

I use short URL's so I did as suggested:

User-agent: *
Disallow: /w/

However, this prevents all the images from being indexed and/or crawled by Google Images, since the images are uploaded to /w/images/.

Does someone know how to fix this? Thanks. --Wikypedista 21:44, 22 December 2010 (UTC)Reply

In theory, you can add an additional override "Allow" line for Google that deals with /w/images/ . But not all search engines understand "Allow", and Google images seems to have other issues with MediaWiki that can prevent indexing. ErkDemon (talk) 11:09, 6 September 2012 (UTC)Reply