Extension:SphinxSearch/SQLite configuration

From mediawiki.org

With a few changes to the sphinx config file and a helper script (php), SphinxSearch works with SQLite based MediaWikis too. SphinxSearch works with any XML-based source.

First, the helper script. This php script is run by the sphinx indexer and does the following:

  • Connects to the MediaWiki SQLite database file using php_pdo (which must be available to run MW on SQLite anyhow)
  • Runs the indexer queries
  • Translate the results into XML for sphinx to process as an xmlpipe2 type source

These can go anywhere you want - /dev or C:\Program Files. Probably they should go in ./maintenance where all the other command-line PHP scripts are.

/usr/share/mediawiki/extensions/SphinxSearch/sqlite.php:

 <?php
  $wikidb = '/var/data/wikidb.sqlite';
  $xmlns_sphinx = 'http://sphinxsearch.com/sphinx';
  $cdata_esc = ['old_text'];
  $html_esc = ['page_title'];
  $incremental = false;
  date_default_timezone_set("America/New_York");
  $opts = getopt(null, ['incremental', 'wikidb:']);
  if (array_key_exists('incremental', $opts)) { $incremental = true; }
  if (array_key_exists('wikidb', $opts)) { $wikidb = $opts['wikidb']; }

  if (!is_file($wikidb)) { exit; }
  if (!$db = new PDO("sqlite:".$wikidb)) { exit; }

  $qry  = "SELECT page_id,page_title,page_namespace,old_id,old_text ";
  $qry .= "FROM page,revision,text WHERE rev_id=page_latest AND old_id=rev_text_id ";
  if($incremental) {
      $qry .= "AND page_touched >= ".strftime( "%Y%m%d" )."040000 ";
  }
  $res=$db->query($qry);

  print '<?xml version="1.0" encoding="utf-8"?>'."\n";
  print '<sphinx:docset xmlns:sphinx="'.$xmlns_sphinx.'">'."\n";
  while ($arr=$res->fetch(PDO::FETCH_ASSOC)) {
    $xmlout = '<sphinx:document id="'.$arr['page_id'].'">'."\n";
    foreach ($arr as $strField => $strVal) {
      $xmlout .= '<'.$strField.'>';
      if ( in_array($strField, $cdata_esc)
        || in_array($strField, $html_esc)) {
        if (in_array($strField, $cdata_esc ) && !strstr( $strVal, ']]>')) {
          $xmlout .= "<![CDATA[\n" . $strVal . "\n]]>";
        } else {
          $xmlout .= htmlspecialchars( $strVal );
        }
      }
      $xmlout .= '</'.$strField.'>'."\n";
    }
    $xmlout .= '</sphinx:document>'."\n";
    print $xmlout;
  }
  print '</sphinx:docset>'."\n";
?>

Incremental update requires the --incremental option. The only difference is the extra AND clause in the query.

You can test these with php -e sqlite.php and they should spit out XML.

We will use these to feed the xmlpipe2 sources in sphinx.conf. The fields and attributes are defined in the config file, although you can also do this in the XML itself. Here is what your source containers look like. These replace the MySQL ones of the same names.

/etc/sphinx/sphinx.conf:

 source src_wiki_main
 {
     type            = xmlpipe2
     xmlpipe_command = php -e /usr/share/mediawiki/extensions/SphinxSearch/sqlite.php
 
     xmlpipe_field   = page_id
     xmlpipe_field   = page_title
     xmlpipe_field   = old_text
 
     xmlpipe_attr_uint   = page_namespace
     xmlpipe_attr_uint   = old_id
 }
 
 # data source definition for the incremental index
 source src_wiki_incremental : src_wiki_main
 {
     xmlpipe_command = php -e /usr/share/mediawiki/extensions/SphinxSearch/sqlite.php --incremental
     # all other parameters are copied from the parent source,
 }

You will likely need to specify the full path to the PHP command line executable in the xmlpipe_command directives if this isn't in cron's exection path. (i.e., xmlpipe_command = /usr/bin/php -e [...]).

That's it. All else is as it appears on the Extension Page.