Manual:Search MediaWiki systems

From MediaWiki.org
Jump to: navigation, search

This articles suggests ways that external search engines, like htDig, can be configured to allow efficient searching and spidering of mediawiki-based systems.

[edit] Spidering information

Use the robots.txt file to tell spiders what pages to index and not. Take a look at Wikipedia's own robots.txt.

You probably want spiders to not follow index.php dynamic pages, just the basic /wiki/article content pages. MediaWiki already outputs <meta name="robots" content="noindex,nofollow" /> in the HTML of "Edit this page" and "History" pages.

[edit] Specific Page Sections

You don't want search engines to index all the boilerplate on pages in the navigation sidebar and footer. Otherwise searching for "privacy" or "navigation" will return every single page.

The old way to do this was to put <NOSPIDER>...</NOSPIDER> around such HTML sections. This is invalid XHTML unless you declare a namespace it, but I don't know whether search engines still looking for nospider will handle e.g. <i:NOSPIDER>. Google instead uses comments for the same purpose: <!--googleoff: index--> ... <!--googleon: index-->.

I believe the MediaWiki software should output these tags around boilerplate. Googling English wikipedia for 'privacy' returns 2,920,000 pages! (bug 5707 filed)

[edit] Related extensions

Language: English
Personal tools
Namespaces
Variants
Actions
Site
Support
Download
Development
Communication
Print/export
Toolbox