Extension:SphinxSearch/SQLite configuration

From MediaWiki.org
Jump to: navigation, search

First off, many thanks for the extension. For me it is a real problem-solver.

With a few changes to the sphinx config file and a couple of helper scripts (php), SphinxSearch works with SQLite based MediaWikis, too. Thought this might be of interest, seeing as vanilla search does not work with SQLite in the current stable MediaWiki (1.15.1).

My setup:

MediaWiki 1.15.1
PHP 5.1.6
SphinxSearch 0.6.1
Sphinx 0.9.8.1

First, the helper scripts. These are php scripts that are run by the sphinx indexer and do the following:

  • Connect to the MediaWiki SQLite database file using php_pdo (which must be available to run MW on SQLite anyhow)
  • Run the indexer queries
  • Translate the results into XML for sphinx to process as an xmlpipe2 type source

These can go anywhere you want - I put mine in the ./data/ directory alongside the wikidb.sqlite file. Probably they should go in ./maintenance where all the other command-line PHP scripts are.

This is the main update script:

/path/to/wiki/data/sphinx_sqlite_main.php:

<?php
   // This is the path to your SQLite MediaWiki database file
   $wikidb='/path/to/wiki/data/wikidb.sqlite';

   // Bail if $wikidb is not a file
   if (!is_file($wikidb)) {
       exit;
   }

   // Bail if PDO constuctor fails
   if (!$db = new PDO("sqlite:".$wikidb)) {
       exit;
   }

   // Build the query
   $qry  = "SELECT page_id,page_title,page_namespace,old_id,old_text ";
   $qry .= "FROM page,revision,text WHERE rev_id=page_latest AND old_id=rev_text_id";

   // Run the query
   $res=$db->query($qry);

   // Parse the results into XML
   $xmlout  = '<?xml version="1.0" encoding="utf-8"?>'."\n"; // Should be true as SQLite downconverts UTF-16
   $xmlout .= '<sphinx:docset>'."\n";

   while($arr=$res->fetch(PDO::FETCH_ASSOC)) {
       $xmlout .= '<sphinx:document id="'.$arr['page_id'].'">'."\n";
       foreach ($arr as $strField => $strVal) {
           $xmlout .= '<'.$strField.'>';
           $xmlout .= $strVal;
           $xmlout .= '</'.$strField.'>'."\n";
       }
       $xmlout .= '</sphinx:document>'."\n";
   }

   $xmlout .= '</sphinx:docset>'."\n";

   // Deconstruct
   $db = null;

   // Return the XML
   if ($xmlout != '') {
       print $xmlout;
   }
?>

This is the incremental update script. The only difference is the extra AND clause in the query. Probably these should be combined, but I didn't want to bother with command line options. I'm lazy.

/path/to/wiki/data/sphinx_sqlite_incremental.php:

<?php
   // This is the path to your SQLite MediaWiki database file
   $wikidb='/path/to/wiki/data/wikidb.sqlite';

   // Bail if $wikidb is not a file
   if (!is_file($wikidb)) {
       exit;
   }

   // Bail if PDO constuctor fails
   if (!$db = new PDO("sqlite:".$wikidb)) {
       exit;
   }

   // Build the query
   $qry  = "SELECT page_id,page_title,page_namespace,old_id,old_text ";
   $qry .= "FROM page,revision,text WHERE rev_id=page_latest AND old_id=rev_text_id ";
   $qry .= "AND page_touched >= ".strftime("%Y%m%d")."070000 "; // adjust to the time of the main indexing job in UTC

   // Run the query
   $res=$db->query($qry);

   // Parse the results into XML
   $xmlout  = '<?xml version="1.0" encoding="utf-8"?>'."\n"; // Should be true as SQLite downconverts UTF-16
   $xmlout .= '<sphinx:docset>'."\n";

   while($arr=$res->fetch(PDO::FETCH_ASSOC)) {
       $xmlout .= '<sphinx:document id="'.$arr['page_id'].'">'."\n";
       foreach ($arr as $strField => $strVal) {
           $xmlout .= '<'.$strField.'>';
           $xmlout .= $strVal;
           $xmlout .= '</'.$strField.'>'."\n";
       }
       $xmlout .= '</sphinx:document>'."\n";
   }

   $xmlout .= '</sphinx:docset>'."\n";

   // Deconstruct
   $db = null;

   // Return the XML
   if ($xmlout != '') {
       print $xmlout;
   }
?>

You can test these with php -e script.php and they should spit out XML.

We will use these to feed the xmlpipe2 sources in sphinx.conf. I chose to define the fields and attributes in the config file, though you can also do this in the XML itself. Here is what your source containers look like. These replace the MySQL ones of the same names.

sphinx.conf:

source src_wiki_main
{
    type            = xmlpipe2
    xmlpipe_command = php -e /path/to/wiki/data/sphinx_sqlite_main.php

    xmlpipe_field   = page_id
    xmlpipe_field   = page_title
    xmlpipe_field   = old_text

    xmlpipe_attr_uint   = page_namespace
    xmlpipe_attr_uint   = old_id
}

# data source definition for the incremental index
source src_wiki_incremental : src_wiki_main
{
    xmlpipe_command = php -e /path/to/wiki/data/sphinx_sqlite_incremental.php
    # all other parameters are copied from the parent source,
}

You will likely need to specify the full path to the PHP command line executable in the xmlpipe_command directives if this isn't in cron's exection path. (i.e., xmlpipe_command = /usr/bin/php -e [...]).

That's it. All else is as it appears on the Extension Page.

NB: This is functional on a dev box with a ten page wiki. It has not been production tested. That is your job <wink>.

-Jef (jef at lfaccess dot net - checked infrequently)

UPDATES -- Eshe 1/05/2011:

The following were modifications that I made in order to get the SQLite installation working correctly:

  • In sphinx_sqlite_incremental.php and sphinx_sqlite_main.php, you will need to update 2 areas of the code, see below with comments marked enpicket

....

  // fix to resolve Warning: strftime() -- enpicket 1/04/2011
  date_default_timezone_set('America/Phoenix'); //change to your timezone

...

// Parse the results into XML
  $xmlout  = '<?xml version="1.0" encoding="utf-8"?>'."\n"; // Should be true as SQLite downconverts UTF-16

  // Fix for error: XML Parsing Error: prefix not bound to a namespace -- enpicket 1/05/2011
  // Define name spaces
  $xmlout .= '<sphinx:docset xmlns:sphinx="http:<your site>/sphinx">'."\n"; //the url does not need to be valid, the parser doesn't check

  while($arr=$res->fetch(PDO::FETCH_ASSOC)) {
      $xmlout .= '<sphinx:document id="'.$arr['page_id'].'">'."\n";
      foreach ($arr as $strField => $strVal) {
          $xmlout .= '<'.$strField.'>';
          // added CDATA  because html data in the pages breaks the xml parser. -- enpicket 1/05/2011
          $xmlout .= "<![CDATA[\n" . $strVal . "\n]]>";

          $xmlout .= '</'.$strField.'>'."\n";
      }
      $xmlout .= '</sphinx:document>'."\n";
  }

  $xmlout .= '</sphinx:docset>'."\n";

Hope this helps anyone else who is using a SQLite MediaWiki Installation!

- Eshe (eshe dot n dot pickett at intel dot com)