Manual:Robots.txt/fr

Les fichiers robots.txt font partie des Standards d'exclusion des robots, et peuvent aider avec l'. Ils indiquent aux robots web comment indexer un site. Un fichier robots.txt doit être placé à la racine du web d'un domaine.

Empêcher toute indexation
Ce code empêche chaque robot d'indexer toutes les pages de votre site :

If you only want to block a certain spider, replace the asterisk with the spider's user agent.

Empêcher l'indexation des pages qui ne sont pas des articles
MediaWiki generates many pages that are only useful for live humans: old revisions and diffs tend to duplicate content found in articles. Edit pages and most special pages are dynamically generated, which makes them useful only to human editors and relatively expensive to serve. If not directed otherwise, spiders may try to index thousands of similar pages, overloading the webserver.

Avec les URLs courtes
It is easy to prevent spiders from indexing non-article pages if you are using Wikipedia-style short URLs. Assuming articles are accessible through /wiki/Some_title and everything else is available through /w/index.php?title=Some_title&someoption=blah:

Soyez prudent, néanmoins ! Si par accident, vous introduisez cette ligne :

Vous bloquerez l'accès au répertoire /wiki, et les moteurs de recherche vont ignorer votre wiki !

Be aware that this solution will also cause CSS, JavaScript and image files to be blocked, so search engines like Google will not be able to render previews of wiki articles. Pour contourner ceci, au lieu de bloquer le répertoire /w dans sa totalité, seulement index.php a besoin d'être bloqué :

Ceci fonctionne parce que CSS et JavaScript sont ramenés via /w/load.php. Alternatively you could do it as it is done on the Wikimedia farm:

Sans les URLs courtes
If you are not using, restricting robots is a bit harder. If you are running PHP as CGI and you have not beautified URLs, so that articles are accessible through /index.php?title=Some_title:

If you are running PHP as an Apache module and you have not beautified URLs, so that articles are accessible through /index.php/Some_title:

Les lignes sans les deux points à la fin restreignent les pages de discussion de ces espaces de noms.

Les wikis qui ne sont pas en anglais devraient ajouter diverses traductions des lignes ci-dessus.

Vous pouvez ne pas mettre de restrictions sur /skins/, sinon cela empêche l'accéder aux images appartenant à l'habillage. Les moteurs de recherche qui affichent l'aperçu des images, tel Google, vont afficher les articles avec des images absentes s'ils ne peuvent pas accéder au répertoire /skins/</tt>.

Vous pouvez aussi essayer

because some robots like Googlebot accept this wildcard extension to the robots.txt standard, which stops most of what we don't want robots sifting through, just like the /w/ solution above. This does however, suffer from the same limitations in that it blocks access to CSS, preventing search engines from correctly rendering preview images. It may be possible to solve this by adding another line Allow: /load.php</tt> however at the time of writing this is untested.

Autoriser l'indexation des pages brutes par l'archiveur internet
You may wish to allow the Internet Archiver to index raw pages so that the raw wikitext of pages will be on permanent record. This way, it will be easier, in the event the wiki goes down, for people to put the content on another wiki. You would use:

Rate control
You can only specify what paths a bot is allowed to spider. Even allowing just the plain page area can be a huge burden when two or three pages per second are being requested by one spider over two hundred thousand pages.

Some bots have a custom specification for this; Inktomi responds to a "Crawl-delay" line which can specify the minimum delay in seconds between hits. (Their default is 15 seconds.)

Les robots diaboliques
Sometimes a custom-written bot isn't very smart or is outright malicious and doesn't obey robots.txt at all (or obeys the path restrictions but spiders very fast, bogging down the site). It may be necessary to block specific user-agent strings or individual IPs of offenders.

More generally, request throttling can stop such bots without requiring your repeated intervention.

An alternative or complementary strategy is to deploy a spider trap.

Spidering vs. indexing
While robots.txt stops (non-evil) bots from downloading the URL, it does not stop them from indexing it. This means that they might still show up in the results of Google and other search engines, as long as there are external links pointing to them. (What's worse, since the bots do not download such pages, noindex meta tags placed in them will have no effect.) For single wiki pages, the  magic word might be a more reliable option for keeping them out of search results.