Extension:OAIRepository

From MediaWiki.org
Jump to: navigation, search
MediaWiki extensions manual - list
Crystal Clear action run.png
OAIRepository

Release status: experimental

Implementation Data extraction
Description Provides an OAI-PMH repository interface
Author(s) Brion Vibber (Brion VIBBERtalk)
Latest version continuous updates
MediaWiki 1.4+
Database changes Yes
License GPL v2+
Download
README
Parameters
  • $oaiAgentRegex
  • $oaiAuth
  • $oaiAudit
  • $oaiAuditDatabase
  • $oaiChunkSize
  • $oaiDeleteIds
Hooks used
ArticleSaveComplete

ArticleDelete
ArticleDeleteComplete
TitleMoveComplete
ParserTestTables
ArticleUndelete
LoadExtensionSchemaUpdates

Translate the OAIRepository extension if possible

Check usage and version matrix; code metrics
Bugs: list open list all report

About[edit | edit source]

Someone PLEASE update the Installation Documentation. It is grossly out of date, and will fail. The SQL does not work either and needs to be hacked up locally depending on your Database names and prefixes. The Separate Database method for OAI will most likely fail unless you really know what you're doing. I cannot submit my shambles of an implementation as it has left many unused tables in 2 separate DB's. It does work but FFS someone needs to re-do the Install Docs!!

  • This worked for me. Download OAIRepository files to your extensions folder, add updates.sql to your mediawiki db (mysql -user wikiuser -pPassword wikidb < update_table.sql), and add the include line to your LocalSettings.php.
(from the README)

This is an extension to MediaWiki to provide an Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) repository interface by which page updates can be snarfed in a relatively sane fashion to a mirror site.

OAI-PMH protocol specs: http://www.openarchives.org/OAI/openarchivesprotocol.html

A harvester script forms the client half. Apply oaiharvest_table.sql to clients to allow saving a checkpointing record; this ensures consistent update ordering.


At the moment this script is quite experimental; it may not implement the whole spec yet, and hooks for actually updating may not be complete.

The extension adds an 'updates' table which associates last-edit timestamps with cur_id values. A separate table is used so it can also hold entries for cur rows which have been deleted, allowing this to be explicitly mentioned to a harvester even if it comes back after quite a while.

Clients will get only the latest current update; this does not include complete old page entries by design, as basic mirrors generally don't need to maintain that extra stuff.


As of May 19, 2008, the updater will attempt to update the links tables on edits, and can fetch uploaded image files automatically.

(Uploads must be enabled locally with $wgEnableUploads = true; or no files will be fetched. image table records will be updated either way.)

Metadata formats[edit | edit source]

The extension supports the following metadata formats/schema:

Dublin Core[edit | edit source]

name value
dc:title Prefixed title
dc:language Manual:$wgContLanguageCode
dc:type 'Text'
dc:format Manual:$wgMimeType
dc:identifier Output from Manual:Hooks/GetFullURL
dc:contributor Username of the revision
dc:date Datestamp of the revision

Install on Wiki[edit | edit source]

Warning Warning: Most of these instructions appear to be obsolete. Installation is now much simpler. See Extension_talk:OAIRepository#classnotfound exception.

Settings[edit | edit source]

From the talk page... This comes from the CommonSettings.php (similar to LocalSettings.php in most MediaWiki installations) on actual Wikimedia servers.

Add to localSettings.php :

# OAI repository for update server
@include( $IP.'/extensions/OAI/OAIRepo.php' );
$oaiAgentRegex = '/experimental/';
$oaiAuth = true; # broken... squid? php config? wtf
$oaiAudit = true;
$oaiAuditDatabase = 'oai';
$wgDebugLogGroups['oai'] = '/home/wikipedia/logs/oai.log';

MySQL part[edit | edit source]

I did this from the command line, so bear with me and/or adapt to the graphical version. It's assumed here you know the mySQL root password.

  • Replace /*$wgDBprefix*/ in update_table.sql with the actual value of the prefix (which is set in LocalSettings.php).
  • update_table.sql goes for the wiki DB (replace wikidb with your wiki database name if necessary). NOTE: This will take a significant amount of time on rather large wikis.
mysql wikidb -uroot -p < update_table.sql
  • oaiuser_table.sql , oaiharvest_table.sql , oaiaudit_table.sql goes for an OAI DB, at which the wiki DB user must have access
    If you want everything in the same database follow section 1, otherwise follow step 2.
    1. EITHER Change the following in LocalSettings.php:
      $oaiAuditDatabase = 'wikidb'; //your wiki database name which is probably in $wgDBname
      
    2. OR Create a separate database for the oai info.
      • log to mysql
         mysql -uroot -p
        
      • Once inside, create the oai database and give your "wiki" user (the login used in your LocalSettings.php for mySQL connections) all rights on it
        CREATE DATABASE oai;
        
        GRANT ALL PRIVILEGES ON oai.* TO 'wikiuser'@'localhost';
        
        FLUSH PRIVILEGES;
        
        exit
        
  • Go into the remaining .sql files and make sure to add your table prefix that is found in your LocalSettings.php as the value for $wgDBprefix. Replace each instance of /*$wgDBprefix*/ with the actual prefix.
  • Create the tables by feeding the commands to mysql (where "oai" is the database you are putting the data into and "root" is your mySQL user):
mysql oai -uroot -p < oaiaudit_table.sql
mysql oai -uroot -p < oaiharvest_table.sql
mysql oai -uroot -p < oaiuser_table.sql
  • to be able to log to the OAIRepository, you'll have to add a login to the oaiuser table. These don't need to be the same as $wgDBuser and $wgDBpassword, but you will need to know them in the next section where you have to add them to lsearch.conf (again, remember to replace /*$wgDBprefix*/ with the table prefix for your wiki).
echo  "INSERT INTO /*$wgDBprefix*/oaiuser(ou_name, ou_password_hash) VALUES ('thename', md5('thepassword') );" | mysql oai -uroot -p

Install on Lucene-search server[edit | edit source]

Note: these instructions are for the lucene-search server version 2.0. In version 2.1, appropriate configuration is generated by ./configure, and should be ready to use. Once you included OAIRepo.php in your LocalSettings.php put ./update in crontab to incrementally update the index.
  1. Create a new mysql database, e.g. lsearch and make sure it's an utf-8 database. It's needed to store the article ranking data. This data is normally recalculated by the importer at each import. This can be done by issuing the mysql command:
    CREATE DATABASE lsearch DEFAULT CHARACTER SET utf8;
    
  2. Setup the Storage section in local configuration (lsearch.conf). These should be the username/password and administrative username/password for accessing the databases.
    Warning Warning: This warning is only for sites which use master/slave replication If you use the load-balancing provided by the "Storage.slaves" options, you will need to make sure that your 'lsearch' database created above is also replicated as part of your master/slave replication. This can be done by adding another line to your mysql.cnf on your master and slave databases. The slave should have a new line which says replicate-do-db=lsearch and the master should have a new line which says binlog-do-db=lsearch. Do not just add the database name to any existing [whatever]-do-db lines - each database should have its own line.
  3. Supply the username/password in the Log, ganglia, localization section as for the OAI.username and OAI.password tables. This should be the username/password you created above in the step where you inserted into the oaiusers table.
  4. Rebuild article rank data, You can put it on a cron job once a week, or once a month (article ranks typically change very slowly):
    php maintenance/dumpBackup.php --current --quiet > wikidb.xml &&
    
    java -Xmx2048m -cp LuceneSearch.jar org.wikimedia.lsearch.ranks.RankBuilder wikidb.xml wikidb
    
    The "-Xmx2048m" is optional and should only be used if you have 2gigs of RAM to devote to the loading. If you don't include this setting at all, you will likely run out of heap-space during the update. If you don't have as much RAM to devote, just put in a smaller number of megs instead of 2048.
  5. Create the initial version of the index - you can do this using the importer described on the Lucene-search server page.
  6. Setup OAI repository for the incremental updater, in global config (lsearch-global.conf), setup a mapping of dbname : host, and in local settings supply username/password in OAI.username/password if any.
  7. Setup [OAI] section in global config (lsearch-global.conf) like this,
    [OAI]
    wikidb : http://localhost/wiki/index.php
    <default> : http://localhost/wiki/index.php
    <dbSuffix> : <base url (to index.php)>
  8. Start incremental updater with:
    java -Xmx1024m -cp LuceneSearch.jar org.wikimedia.lsearch.oai.IncrementalUpdater -n -d -s 600 -dt start_time wikidb
    
    The parameters are:
    • -n - wait for notification from indexer that articles has been successfully added
    • -d - daemonize, i.e. run updates in an infinite loop
    • -s 600 - after one round of updates sleep 10 minutes (600s)
    • -dt timestamp - default timestamp (e.g. 2007-06-17T15:00:00Z) - This is the timestamp of your initial index build. You need to pass this parameter the first time you start the incremental updater, so it knows from what time to start the updates. Afterward the incremental updater will keep the timestamp of last successfull update in indexes/status/wikidb.

Alternative to (2),(3) and (4) is not to use ranking. You can do this by passing --no-ranks parameter to the incremental updater, and it won't try to fetch ranks from the mysql database. If your wiki is small and has some hundreds of pages, you probably don't need any ranking. But if you have or plan to have hundreds of thousands of pages, you will definitely benefit from ranking data.


The above only sets up incremental updates to the index. To instruct the indexer to make a snapshot of index periodically (which get picked up by searchers), put this into your cron job:

   curl http://indexerhost:8321/makeSnapshots

Indexer has a command http interface. Other commands are getStatus, flushAll, etc ...

NOTES[edit | edit source]

  • Warning Warning: The current version of OAI won't work with MW1.12 or lower, since the add of wfGetLB() (LBFactory abstract class) in rev:32578. To uses with 1.12, download this version of the files. Make sure you use the ExtensionDistributor by going to the "download snapshot" link in the infobox to help you get the right version.
  • Uploads must be enabled locally with $wgEnableUploads = true; or no files will be fetched. image table records will be updated either way.

See also[edit | edit source]