Manual:Internet Archive

From mediawiki.org

The Internet Archive, a non-profit digital library with the stated mission of "universal access to all knowledge", is a repository where one can store and retrieve wiki content. Wikis frequently go down when owners die or are otherwise incapacitated from paying the server bills; or when they give up because they find they are unable a cultivate a vibrant community, or are not equal to the task of coping with spam or other problems. The Internet Archive then becomes the only way to get the content, since most wiki owners do not share the database with the general public.

However, the Internet Archive does not always make the raw wikitext available for those who may have wanted to import it into their own wikis. Many wikis set robot policies that prevent some or all of their content from being archived; for example, the English Wikipedia excludes deletion debates from being archived.[1] Also, sometimes when sites go down, the new domain owner sets a robot policy that prevents archives of the old content from being viewed.

Getting data from your wiki into the Internet Archiver[edit]

You need to either (1) have an AllPages list leading to the raw text of each page (see, e.g. Extension:OneColumnAllPages#Raw) that's archived or (2) have a link on each page that the Internet Archiver can use to get to the raw text. You can put in MediaWiki:Sidebar:

**{{fullurl:{{FULLPAGENAMEE}}|action=raw}}|View raw wikitext

Or add such a link to your skin. (TODO: Explain how to do this. Something involving SkinTemplate, perhaps?)

(TODO: Provide a sample cron job to export everything but database fields containing sensitive data to an sql.tar.gz file. Also provide instructions on how to automatically have this put in the Internet Archive.)

Importing content from the Internet Archive to your wiki[edit]

Ideally, content from all public wikis should be retrievable from the Internet Archive in some sort of XML or SQL format. Most wikis that have gone down didn't do this, so it'll be necessary to get the data some other way. (TODO: How to retrieve an AllPages list?)

Ugly URLs[edit]

The upside of a wiki that has ugly URLs is that usually there will be access to either action=edit or action=raw.

The following case study deals with retrieving data from the archived Youth Rights Network site. This site uses MediaWiki 1.15.1, but for some reason action=raw isn't available. (TODO: When did action=raw became available? 2004? What MW version was that? Oh dear, it's going to be necessary to look in SVN.)

It is, however, possible to get the revision text by crawling, e.g. https://web.archive.org/web/20110701091737/http://youthrights.net/index.php?title=Main_Page&action=edit The top level of the AllPages hierarchy is accessible but doesn't let you actually drill down to the list of pages. (TODO: Now what? Do I design a bot from scratch to get this data, or is a script already available that can be adapted for this purpose?)

See also[edit]

  • task T64468 — Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

External link[edit]