Extension:Memento
From MediaWiki.org
|
Release status: experimental |
|||
|---|---|---|---|
| Implementation | Data extraction, User interface | ||
| Description | Implements support of X-Accept-Datetime header | ||
| Author(s) | Harihar Shankar and Robert Sanderson | ||
| Last Version | 0.3 | ||
| MediaWiki | 1.6.0+ | ||
| License | GPL | ||
| Download | no link | ||
|
|||
|
check usage (experimental) |
|||
Contents |
[edit] What can this extension do?
The Memento extension implements support of the X-Accept-Datetime HTTP header to perform content negotiation in the date-time dimension, built on the principles of RFC 2295 [1]. This enables MediaWiki to be used as a web archive.[2]
The extension works in three simple steps:
- Checks for the existence of an X-Accept-Datetime header in the client's request.
- If the X-Accept-Datetime header exists, redirect the client to the version of the requested resource that was the live version at the date-time expressed as the value of the X-Accept-Datetime header.
- If the X-Accept-Datetime header does not exist, handle the client's request as usual. Nothing out of the ordinary will happen.
This plug-in uses the same handlers that MediaWiki does to connect to the database and hence all the existing database permissions and page access permissions are honored. This plug-in only uses a 'DB_SLAVE' database connection, which means that the database connection can only read from the tables. Hence, this plug-in makes no changes to the database.
[edit] Installation
To install this extension, add the following to LocalSettings.php:
if (!$wgCommandLineMode) { require_once "$IP/extensions//memento.php"; }
[edit] Configuration parameters
$wgMementoConfigDeleted = true/false; # Toggles the feature to do datetime content negotiation for deleted pages.
[edit] Usage
Once installed, the extension can be tested and used in two ways:
- Using a Firefox browser: Install the Modify Headers FireFox extension. Then set the X-Accept-Datetime header from the Tools/Modify Headers menu option. The syntax to use is X-Accept-Datetime: {Sat, 03 Oct 2009 10:00:00 GMT}. Set it to a date-time at which your wiki was already generating history pages. Then enter a URL of a page from your wiki that has associated history pages around the date-time you chose. If all is well you should immediately retrieve the history page that was the active version at the date-time you picked.
- Using the UNIX command line tool curl: To achieve the equivalent of (1) using curl, the command would be:
curl -o null.html -D headers.txt -H
"X-Accept-Datetime: {Sat, 03 Oct 2009 10:00:00 GMT}"
http://your.wikiserver.here/your-title-here
And then look in headers.txt to make sure it looks similar to:
HTTP/1.1 302 Found
Date: Tue, 13 Oct 2009 20:07:27 GMT
Server: Apache
Location: http://your.wikiserver.here?title=your-title-here&oldid=123456
TCN: choice
Vary: negotiate, X-Accept-Datetime
X-Archival-Interval: {Fri, 02 Oct 2009 08:00:00 GMT} -
{Mon, 05 Oct 2009 22:45:12 GMT}
Alternates:
{"http://your.wikiserver.here?title=your-title-here&oldid=123322" 0.8
{type text/html} {dt Fri, 02 Oct 2009 08:00:00 GMT}} ,
{"http://your.wikiserver.here?title=your-title-here&oldid=123842" 0.8
{type text/html} {dt Mon, 05 Oct 2009 22:45:12 GMT}}
Content-Type: text/html; charset=UTF-8
[edit] Namespaces
The extension renders the requested page the same way MediaWiki does. It queries the wiki database table page with the requested title. Both MediaWiki reserved namespaces and custom namespaces are accounted for, by retrieving the namespace_id from the object $wgTitle. If the namespace does not exist, then the plug-in treats the namespace also as part of the title. For example, if the requested title is 'Memento:Main_Page', the plugin will first check if a namespace exist for "Memento" and retrieve it's corresponding namespace_id. It will then query the page table for the title 'Main_Page' with the namespace_id. Otherwise, it will treat "Memento" also as part of the title and search the page table for the title 'Memento:Main_Page'. If the title could not be found in the database, then an HTTP/1.1 404 Not Found is returned.
[edit] Templates
Mediawiki by default, retrieves the most recent version of a template when transcluded in an article. This extension cannot perform datetime content negotiations on transcluded templates. However, we have written a quick fix that would perform this operation by adding the following code to the file Parser.php in the directory path/to/wiki/includes/parser/.
# querying the db to get the rev_id for the template.
foreach($_SERVER as $key => $value) {
//checking for the occurance of the accept datetime header.
if( strcasecmp($key, 'HTTP_X_ACCEPT_DATETIME') == 0 ) {
$req_dt = $_SERVER["$key"];
$dt = strtotime($_SERVER["$key"]);
$dt = date( 'YmdHis', $dt );
$pg_id = $title->getArticleID();
$dbr = wfGetDB( DB_SLAVE );
$dbr->begin();
$tbl_rev = $dbr->tableName( 'revision' );
$res = $dbr->query( "SELECT DISTINCTROW rev_id FROM $tbl_rev
WHERE rev_page = $pg_id
AND rev_timestamp <= $dt
ORDER BY rev_id DESC
LIMIT 0,1"
);
if( $res ) {
$row = $dbr->fetchObject( $res );
$id = $row->rev_id;
}
}
}
Paste the code above in the function statelessFetchTemplate(...), immediately after the variable
$id = false;
is declared. This code will fetch the revision_id for the template for the datetime requested and direct mediawiki to fetch that rev_id instead of fetching the latest version of the template using the title. This code's been written for mediawiki version 1.8+.
[edit] Caching
Mediawiki by default searches it's cache for templates using the title and retrieves the most recent version. For best result, it is recommended that the caching is disabled for templates so that mediawiki always queries the database for the revision. this can be done by either commenting the respective lines in the function getTemplateDom in Parser.php or write a simple code to skip the caching part if the X-Accept-Datetime header is detected.
[edit] Special Pages
Special pages under the URL http://your.wikiserver.here/index.php/Special:SpecialPages do not have a history, i.e. there are no revisions to these pages. Hence, the Memento extension will return an HTTP/1.1 406 Not Acceptable.
[edit] Deleted Contributions
To do date-time negotiations for the deleted revisions in MediaWiki, most installations require "Administrator" privileges. Even with administrative access, MediaWiki can only show the revisions in "Edit" mode.
To enable this feature, set the configuration variable $wgMementoConfigDeleted to true.
[edit] Timestamps
This extension searches for and retrieves the revisions for an article using the timestamp of when the revision was generated. Timestamps are not unique identifiers and it is possible that an article will have more than one revision at the same given time. This extension handles this situation by returning an HTTP/1.1 300 Multiple Choices, with the list of URIs which were created at the same time. MediaWiki does not resolve deleted revisions using revision ids, but use timestamps instead, in their URIs. Hence, we could not come up with a way to resolve a situation when more than one deleted revision has the same timestamp.
[edit] Code
<?php $mmScriptPath = $wgScriptPath . '/extensions/memento'; $wgExtensionFunctions[] = 'mmSetupExtension'; $wgExtensionCredits['specialpage'][] = array( 'name' => 'Special:Memento', 'description' => 'Retrieve archived versions of the article using HTTP datetime headers.', 'url' => 'http://mementoweb.org/, http://lanlsource.lanl.gov', 'author' => 'Harihar Shankar, Herbert Van de Sompel, Robert Sanderson', 'version' => '0.1', ); $historyuri; function mmSetupExtension() { global $wgHooks; $wgHooks['BeforePageDisplay'][] = 'mmVerifyDateTime'; return true; } function mmVerifyDateTime() { global $wgTitle; global $wgMementoConfigDeleted; //Making sure the header is checked only in the main title page. if ( !stripos( $_SERVER['REQUEST_URI'], '?' ) ) { foreach($_SERVER as $key => $value) { //checking for the occurance of the accept datetime header. if( strcasecmp($key, 'HTTP_X_ACCEPT_DATETIME') == 0 ) { //setting the default timezone to UTC. date_default_timezone_set('UTC'); $serveruri = $_SERVER['HTTP_HOST']; //getting the title of the page from the request uri $requri = explode( "/", $_SERVER['REQUEST_URI'] ); $l = count( $requri ); $title = $requri[$l-1]; //building the history uri in stages. $historyuri = "http://".$serveruri; for( $i=0; $i<($l-2); $i++ ) { $historyuri .= $requri[$i]."/"; } //getting the datetime from the http header, first converting it into unix format and then into the format wikidb understands. //if the datetime input is not valid, the default is 1970, which will be omitted. $req_dt = $_SERVER["$key"]; $dt = strtotime($_SERVER["$key"]); $dt = date( 'YmdHis', $dt ); $wgMementoReqDateTime = $dt; $current = date( 'YmdHis', time() ); if( $dt != 19700101000000 ) { //checking if the date is greater than 1990, else continue loading the usual page. if( substr($dt, 0, 4 ) > 1990 && $current > $dt ) { //creating a db object to retrieve the old revision id from the db. $dbr = wfGetDB( DB_SLAVE ); $dbr->begin(); //retrieving the "page_id" using the title of the article. This page_id is a unique id for every article. $tbl_pg = $dbr->tableName( 'page' ); $tbl_ar = $dbr->tableName( 'archive' ); //this section checks for namespaces in the title and picks up it's corresponding id to query the database... //the default value is 0... the "page" table will be checked assuming ":" is part of the title and not a namespace $page_namespace_id = $wgTitle->getNamespace(); $new_title = $title; if( stripos( $title, ':') ) { $ns_title = explode( ":", $title ); $namespace = $ns_title[0]; $new_title = $ns_title[1]; } $res_pg = $dbr->query( "SELECT DISTINCTROW page_id FROM $tbl_pg WHERE page_title = '$new_title' AND page_namespace=$page_namespace_id" ); if ( $res_pg ) { $row_pg = $dbr->fetchObject( $res_pg ); $pg_id = $row_pg->page_id; if( $pg_id > 0 ) { //using the page_id and the timestamp from the header, the revision table is queried to retrieve the necessary revision. $tbl_rev = $dbr->tableName( 'revision' ); $res = $dbr->query( "SELECT DISTINCTROW rev_id, rev_timestamp FROM $tbl_rev WHERE rev_page = $pg_id AND rev_timestamp <= $dt ORDER BY rev_id DESC" ); if( $res ) { while ( $row = $dbr->fetchObject( $res ) ) { $values['timestamp'] = $row->rev_timestamp; $values['revid'] = $row->rev_id; $arr[] = $values; } $multiple = false; for( $i=1; $i<count($arr); $i++ ) { if ( $arr[0]['timestamp'] == $arr[$i]['timestamp'] ) { $multiple = true; $mul_rev_id[$i] = $arr[$i]['revid']; } } $sem = false; if ( $multiple ) { $mul_rev_id[0] = $arr[0]['revid']; echo "<html><head><title>300 Multiple Choices</title></head>"; echo "<body><h3>Multiple Choices</h3>"; echo "<p>Memento found more than one resource for the specified time. Please choose a resource from the list below.<br/>"; foreach( $mul_rev_id as $values ) { if ( !$sem ) { $sem = true; } else { $alt_header .= ","; } $uri = $historyuri . "index.php?title=".$title."&oldid=".$values['revid']; echo "<a href='$uri'>$uri</a><br/>"; $alt_header .= "{'".$uri."' 0.9 {type text/html} {language en}}"; } header( "Alternates: $alt_header"); //querying the database for the first revision of this article for the x-archive-interval header. $xares = $dbr->query( "SELECT DISTINCTROW rev_timestamp FROM $tbl_rev WHERE rev_page = $pg_id AND rev_timestamp <= $dt LIMIT 0,1" ); $xarow = $dbr->fetchObject( $xares ); $firstRevTS = $xarow->rev_timestamp; header( "X-ARCHIVE-INTERVAL: {".$firstRevTS."} - {".date( 'YmdHis', time() )."}" ); header( 'HTTP', TRUE, 300 ); exit(); } else { $oldid = $arr[0]['revid']; if ( $oldid > 0 ) { $historyuri .= "index.php?title=".$title."&oldid=".$oldid; //querying the database for the first revision of this article for the x-archive-interval header. $xares = $dbr->query( "SELECT DISTINCTROW rev_timestamp FROM $tbl_rev WHERE rev_page = $pg_id AND rev_timestamp <= $dt LIMIT 0,1" ); $xarow = $dbr->fetchObject( $xares ); $firstRevTS = $xarow->rev_timestamp; header( "X-ARCHIVE-INTERVAL: {".$firstRevTS."} - {".date( 'YmdHis', time() )."}" ); header( "Location: $historyuri" ); exit(); } else { echo "Error 406: Cannot find resources for the requested date $req_dt."; header("HTTP", TRUE, 406); exit(); } } } else { echo "Error 406: Cannot find resources for the requested date $req_dt."; header("HTTP", TRUE, 406); exit(); } } //if the title was not found in the page table, the archive table is checked for deleted versions of that article. //provided, the variable $wgMementoConfigDeleted is set to true in the LocalSettings.php file. elseif ( $wgMementoConfigDeleted == true && $res_ar = $dbr->query( "SELECT ar_timestamp FROM $tbl_ar WHERE ar_title = '$new_title' AND ar_namespace = $page_namespace_id ORDER BY ar_timestamp ASC LIMIT 0,1" ) ) { //checking if a revision exists for the requested date. if ( $res_ar_ts = $dbr->query( "SELECT ar_timestamp FROM $tbl_ar WHERE ar_title = '$new_title' AND ar_namespace = $page_namespace_id AND ar_timestamp <= $dt ORDER BY ar_timestamp DESC LIMIT 0,1" ) ) { $row_ar_ts = $dbr->fetchObject( $res_ar_ts ); $ar_ts = $row_ar_ts->ar_timestamp; if ( $ar_ts ) { //redirection is done to the "special page" for deleted articles. $historyuri .= "index.php?title=Special:Undelete&target=".$title."×tamp=".$ar_ts; //the first revision of this article is retrieved for the x-archive-interval header. $row_ar = $dbr->fetchObject( $res_ar ); $firstRevTS = $row_ar->ar_timestamp; header( "X-ARCHIVE-INTERVAL: {".$firstRevTS."} - {".date( 'YmdHis', time() )."}" ); header( "Location: $historyuri" ); exit(); } else { echo "Error 406: Cannot find resources for the requested date $req_dt."; header("HTTP", TRUE, 406); exit(); } } else { echo "Error 406: Cannot find resources for the requested date $req_dt."; header("HTTP", TRUE, 406); exit(); } } else { echo "Error 404: Resource does not exist for the title '$title'."; header("HTTP", TRUE, 404); exit(); } } else { echo "Error 404: Either the resource does not exist or the namespace is not understood by memento for the title '$title'."; header("HTTP", TRUE, 404); exit(); } } else { echo "Error 406: Requested date $req_dt out of range."; header("HTTP", TRUE, 406); exit(); } } else { echo "Error 400: Requested date $req_dt not parseable."; header("HTTP", TRUE, 400); exit(); } } } } return true; }
[edit] See also
- ↑ RFC 2295. http://www.ietf.org/rfc/rfc2295.txt
- ↑ Van de Sompel et al.: Memento: Time Travel for the Web. http://arxiv.org/pdf/0911.1112v2 (preprint)