Manual:Robots.txt/hu

robots.txt files are part of the Robots Exclusion Standard, and can help with. They tell web robots how to crawl a site. A robots.txt file must be placed in the web root of a domain.

Példák


Minden keresés megakadályozása
This code prevents all bots from crawling all pages on your site:

If you only want to block a certain spider, replace the asterisk with the spider's user agent.

Prevent crawling of non-article pages
MediaWiki generates many pages that are only useful for live humans: old revisions and diffs tend to duplicate content found in articles. Edit pages and most special pages are dynamically generated, which makes them useful only to human editors and relatively expensive to serve. If not directed otherwise, spiders may try to index thousands of similar pages, overloading the webserver.



Rövid URL-ekkel
It is easy to prevent spiders from crawling non-article pages if you are using Wikipedia-style short URLs. Assuming articles are accessible through  and everything else is available through  :

Be careful, though! If you put this line by accident:

you'll block access to the /wiki directory, and search engines will drop your wiki!

Be aware that this solution will also cause CSS, JavaScript, and image files to be blocked, so search engines like Google will not be able to render previews of wiki articles. To work around this, instead of blocking the entire  directory, only   need be blocked:

This works because CSS and JavaScript is retrieved via. Alternatively you could do it as it is done on the Wikimedia farm:



Rövid URL-ek nélkül
If you are not using short URLs, restricting robots is a bit harder. If you are running PHP as CGI and you have not beautified URLs, so that articles are accessible through :

If you are running PHP as an Apache module and you have not beautified URLs, so that articles are accessible through :

The lines without the colons at the end restrict those namespaces' talk pages.

Non-English wikis may need to add various translations of the above lines.

You may wish to omit the  restriction, as this will prevent images belonging to the skin from being accessed. Search engines which render preview images, such as Google, will show articles with missing images if they cannot access the  directory.

You can also try

because some robots like Googlebot accept this wildcard extension to the robots.txt standard, which stops most of what we don't want robots sifting through, just like the /w/ solution above. This does, however, suffer from the same limitations in that it blocks access to CSS, preventing search engines from correctly rendering preview images. It may be possible to solve this by adding another line,  however at the time of writing this is untested.

Allow indexing of raw pages by the Internet Archiver
You may wish to allow the Internet Archiver to index raw pages so that the raw wikitext of pages will be on permanent record. This way, it will be easier, in the event the wiki goes down, for people to put the content on another wiki. You would use:

Problémák


Gyakoriságkontroll
You can only specify what paths a bot is allowed to spider. Even allowing just the plain page area can be a huge burden when two or three pages per second are being requested by one spider over two hundred thousand pages.

Some bots have a custom specification for this; Inktomi responds to a "Crawl-delay" line, which can specify the minimum delay in seconds between hits. (Their default is 15 seconds.)



Rossz botok
Néha egy egyedi bot nem túl okos, vagy egyenesen káros, és nem tartja be a robots.txt-t (vagy betartja az útkorlátozásokat, de nagyon gyorsan halad, lassítva az oldalt). Szükséges lehet bizonyos user agentek vagy a támadók IP-címeinek blokkolása.

More generally, request throttling can stop such bots without requiring your repeated intervention.

An alternative or complementary strategy is to deploy a spider trap.

Spidering vs. indexing
Noha a robots.txt megakadályozza a (nem gonosz) botokat a URL letöltésében, nem akadályozza meg az indexelésében. This means that they might still show up in the results of Google and other search engines, as long as there are external links pointing to them. (What's worse, since the bots do not download such pages, noindex meta tags placed in them will have no effect.) For single wiki pages, the  magic word might be a more reliable option for keeping them out of search results.