Extension:WaybackMachine

Purpose
The goal of this extension is to allow users to browse into the past, or the history of a given website. It could be really useful for wikis that talk about digital culture, and history of the Internet. What did Google.com looked like in 1998 ? Please note that this release is really experimental. I'm not putting it here because I want you to use it like it is, but I would like to get the help of an experienced programmer to complete the job because I feel it's out of my scope. The extension does work, but here's what's left to do :


 * Clean up the code (this is very very dirty, this is my second php program ever, first one was Extension:WikiToWordPress).
 * Create a special page that will allow users to do maintenance of the extension (create MySQL tables, delete MySQL tables, refresh all data from archive.org for all wiki pages).
 * Create a setup process so that the user does not have to manually create the MySQL tables.
 * Make a cleaner database structure. I'm sure it can be more efficient. For example instead of just recording the date & url in the database we could already generate the Months, the * data, etc... so that everything is already generated when we want to render it. This would be more efficient.
 * There is currently no mechanisms to delete from the database website entries that are removed from wiki pages and that we don't need anymore.

If you are interested in helping I will be happy to answer any of your question and help if I can, just contact [mailto:jeanfrancois.gariepy@gmail.com me].

Usage
Just include http://www.yoururl.com in your wikitext and the extension will display a table that contains all the archived versions of the website at http://www.archive.org. The user can then browse the archived versions of the website.

Everytime an article with http://www.yoururl.com is saved (new articles or edited), the server downloads the information from archive.org, and imports the data in the MySQL database used by the mediawiki engine. It then creates the tables for the user to view based on this stored data.

Installation
1. Include this line in your php.ini : allow_url_fopen = on WARNING : This may be considered by some person a security risk. As a matter of fact, it is not in itself a security risk but be sure to know what you're doing, because turning on this option with weak php scripts may become dangerous. If you're just using your mediawiki engine on your server, I don't see the problem.

2. Create a table called waybackmachine_archives in your mediawiki's database. Create just one field named placeholder in this table. SMALLINT, NON-NULL, Auto-increment. Make it primary key. Create 3176 values in this field (1,2,3.....).

3. Create a table called waybackmachine_ext in your mediawiki's database. All the next fields are NON-NULL. Create one field named count, SMALLINT, UNSIGNED, Auto-increment. Make it primary key. Create a field named url, TEXT, utf8_general_ci. Then create fields 1996,1997,1998...for each year up to 2007. These should be SMALLINT, NON-NULL, default 0.

4. Copy code below to a file and call it WayBackMachineExtension.php. Place this file in /extensions/.

5. Include the extension in MediaWiki by changing and adding this line to your LocalSettings.php:

6. The Wayback Machine is now installed. Use the appropriate tags http://www.anyurl.com in any wiki page.

As you can see one of the thing that's left is to make an easier setup for the user. If you know how to setup MySQL tables and how to integrate these capabilities in a Special Page in Mediawiki, you're welcome. If you want to clean up the code and make it more efficient, I have no problem with that you can publish your corrections right here.

Code
';

return $wgOut; }

function fnUpdateWayBackMachineDatabase(&$article, &$user, &$text, &$summary, $minor, $watch, $sectionanchor, &$flags) { //This function updates the database values by downloading what's available from archive.org $SearchString='';									//When an article is saved, the program will look for this string in the content $SearchString2='';									//It will look for this one as well

$BeginningPosition=strpos($text,$SearchString);						//Calculates the position of the beginning of  in the mediawiki text $EndingPosition=strpos($text,$SearchString2);						//Calculates the position of the beginning of , 0 if not present.

if ($BeginningPosition) {											//If ==0, then there is nothing to do (no tags present in the text) if ($EndingPosition) {												//If ==0, then there is nothing to do (no tags present in the text)

$loopcondition = 1;													//These variables will be used later on $sposition = 1; $eposition = 0; $i = 0;

$WBSeparator1='';				//This is a good separator to use, it precedes pages data in archive.org htmls $WBSeparator2='pages ';										//This one follows pages data in archive.org (page data means number of pages / year, which can vary from 1 to more than 300) $WBSeparator3='';												//This follows the important part of the URL for each available archive in archive.org htmls

$BeginningPosition += 16;											// is 16 characters long so add 16 to the string position $Length = $EndingPosition-$BeginningPosition;						//Calculate the number of characters between  and 

$url = substr($text, $BeginningPosition, $Length);					//Store the string that is located between  and </WayBackMachine> in $url

$isUrlAlreadyRetrived = 0;											//This variable will change the behaviour of the program, 0 if we already have information about the url in the database, 1 if we need to create everything

$result = mysql_query("SELECT url FROM waybackmachine_ext");		//Scan all urls stored in the database and see if we already have this url. If yes, set $isUrlAlreadyRetrived to 1 while ($row = mysql_fetch_array($result,MYSQL_ASSOC)) { if ($row{'url'} == $url) { $isUrlAlreadyRetrived = '1'; }	}

$content = file_get_contents('http://web.archive.org/web/*/'.$url);	//Download the content for the website we're looking for from archive.org

while ($loopcondition != 0) {										//This loops stores the values for the number of pages in 1996, 1997, etc... stored at archive.org $eposition += 1; $loopcondition = strpos($content, $WBSeparator1, $sposition); $sposition = $loopcondition + 37; $eposition = strpos($content, $WBSeparator2, $eposition); $Length = $eposition - $sposition - 1; $yearsarray[$i] = substr($content, $sposition, $Length); $i += 1; }

if ($isUrlAlreadyRetrived == 0) {									//If the url is completely new, create a new entry in the database

$query = 'INSERT INTO `waybackmachine_ext` (`count`, `url`, `1996`, `1997`, `1998`, `1999`, `2000`, `2001`, `2002`, `2003`, `2004`, `2005`, `2006`, `2007`) VALUES (NULL, \.$url.'\', \.$yearsarray[0].'\', \.$yearsarray[1].'\', \.$yearsarray[2].'\', \.$yearsarray[3].'\', \.$yearsarray[4].'\', \.$yearsarray[5].'\', \.$yearsarray[6].'\', \.$yearsarray[7].'\', \.$yearsarray[8].'\', \.$yearsarray[9].'\', \.$yearsarray[10].'\', \''.$yearsarray[11].'\');'; $results = mysql_query($query);

$dbr =& wfGetDB(DB_SLAVE);											//Read from the database to find what is the index number assigned to the given url (good for newly added and old entries) $res = $dbr->select( 'waybackmachine_ext', array('count'), array( 'url' => $url )); $row = $dbr->fetchObject( $res ); $websiteid = $row->count; $dbr->freeResult( $res );

} else {															//If the url is already known (for example, from another page), just update the values from the downloaded information

$dbr =& wfGetDB(DB_SLAVE);											//Read from the database to find what is the index number assigned to the given url (good for newly added and old entries) $res = $dbr->select( 'waybackmachine_ext', array('count'), array( 'url' => $url )); $row = $dbr->fetchObject( $res ); $websiteid = $row->count; $dbr->freeResult( $res );

$query = 'UPDATE `waybackmachine_ext` SET `1996` = \.$yearsarray[0].'\', `1997` = \.$yearsarray[1].'\', `1998` = \.$yearsarray[2].'\', `1999` = \.$yearsarray[3].'\', `2000` = \.$yearsarray[4].'\', `2001` = \.$yearsarray[5].'\', `2002` = \.$yearsarray[6].'\', `2003` = \.$yearsarray[7].'\', `2004` = \.$yearsarray[8].'\', `2005` = \.$yearsarray[9].'\', `2006` = \.$yearsarray[10].'\', `2007` = \.$yearsarray[11].'\' WHERE `waybackmachine_ext`.`count` = '.$websiteid.' LIMIT 1;'; $results = mysql_query($query);

}

$sposition = 5800;													//This is the position where the scan is going to start in the html file from archive.org. 5800 is good because it avoids a bad entry that comes at 5200 which is a link we're not interested in. It's also faster like this. $loopcondition = 1; $i = 1;

if ($isUrlAlreadyRetrived == 0) {									//If URL is new, creates 2 new arch[n] and star[n] tables in waybackmachine_archives

$query = 'ALTER TABLE `waybackmachine_archives` ADD `arch'.$websiteid.'` VARCHAR( 14 ) NULL;'; mysql_query($query); $query = 'ALTER TABLE `waybackmachine_archives` ADD `star'.$websiteid.'` BINARY NULL;'; mysql_query($query);

}

while ($loopcondition != 0) {										//This finds the URL data to put in arch[n] and the star data (whether or not there is a * on archive.org's display) $loopcondition = strpos($content, $WBSeparator3, $sposition); $sposition = $loopcondition + 36; $dates = substr($content, $sposition, 14); $query = 'UPDATE `waybackmachine_archives` SET `arch'.$websiteid.'` = \''.$dates.'\' WHERE `waybackmachine_archives`.`placeholder` = '.$i.' LIMIT 1;'; mysql_query($query); $starposition = strpos($content, $WBSeparator4, $sposition) + 5; if (substr($content, $starposition, 1) == '*') { $query = 'UPDATE `waybackmachine_archives` SET `star'.$websiteid.'` = \'1\' WHERE `waybackmachine_archives`.`placeholder` = '.$i.' LIMIT 1;'; mysql_query($query); } else { $query = 'UPDATE `waybackmachine_archives` SET `star'.$websiteid.'` = \'0\' WHERE `waybackmachine_archives`.`placeholder` = '.$i.' LIMIT 1;'; mysql_query($query); } $i += 1; }

while ($i != 3174) {												//Fills the rest of the database field with null values (important if there was a previous version but changes happened) $query = 'UPDATE `waybackmachine_archives` SET `arch'.$websiteid.'` = \'NULL\' WHERE `waybackmachine_archives`.`placeholder` = '.$i.' LIMIT 1;'; mysql_query($query); $query = 'UPDATE `waybackmachine_archives` SET `star'.$websiteid.'` = \'NULL\' WHERE `waybackmachine_archives`.`placeholder` = '.$i.' LIMIT 1;'; mysql_query($query); $i += 1; }

$query = 'UPDATE `waybackmachine_archives` SET `star'.$websiteid.'` = \'NULL\' WHERE `waybackmachine_archives`.`arch'.$websiteid.'` = \'p-equiv="conte\' LIMIT 1;'; mysql_query($query);												//For some reason I had to delete 1 entry that was systematically incorrect in the database. I can't find the reason for this bad entry in my code so I'm assigning NULL to it manually here $query = 'UPDATE `waybackmachine_archives` SET `arch'.$websiteid.'` = \'NULL\' WHERE `waybackmachine_archives`.`arch'.$websiteid.'` = \'p-equiv="conte\' LIMIT 1;'; mysql_query($query);

} }

return true;													//Hook functions need to be terminated by Return true for mediawiki to continue working.

}

?>

Related extensions

 * WikiToWordPress