Extension:FileIndexer

From MediaWiki.org

Jump to: navigation, search
This extension requires patches to core MediaWiki code when used with MediaWiki 1.8 and before. Extensions implemented using patches may be disabled by or interfere with upgrades and security patches. If you would like to use this extension, we strongly recommend you upgrade to a version later than 1.8. Alternatively, if a suitable alternative without a patch is available, we recommend you use that extension instead.


Manual on MediaWiki Extensions
List of MediaWiki Extensions
FileIndexer

Release status: unknown

Implementation Special page
Description Makes uploaded document files searchable
Author(s) MHart, Flominator, Hxwiki
MediaWiki 1.9.0 - 1.11
(Versions before require patching)
License No license specified
Download no link

Contents

[edit] History

MHart modified the standard upload page so that uploaded Microsoft Word, Microsoft Excel, and Adobe PDF documents and will have their contents indexable. In addition, documents are now indexable.

He started by downloading and installing various Linux command line utilities that will take one of the above formats and output the text - antiword, xls2csv, pdftotext, and ppt2text (catppt).

Then he modified SpecialUpload.php where it tests for a successful upload and just before it inserts the uploaded file information into the database. What it did was make the text of the word document an HTML comment block in the description text of the image's file page.

Two years later Flominator found the hook UploadForm:BeforeProcessing and created an extension out of it. Then Hxwiki came and modified the code to work with MediaWiki 1.11.

Caution! Caution: A user must change their preferences to search Images to be able to search the image's page.

[edit] Installation

Put this text to extensions/FileIndexer/FileIndexer.php ...

[edit] MediaWiki >= 1.11.0

<?php
$wgHooks['UploadForm:BeforeProcessing'][] = 'ScanFileForIndex';
 
//Testing if these commands are available
if($wgFileIndexerDebug==true)
{
	isCommandPresent("/usr/bin/pdftotext");
	isCommandPresent("/usr/bin/iconv");
	isCommandPresent("/usr/bin/antiword");
	isCommandPresent("/usr/bin/xls2csv");
	isCommandPresent("/usr/bin/catppt");
	isCommandPresent("/usr/bin/strings");
}
 
function ScanFileForIndex($uploadFormObj) 
{
 
         $NewDesc = '';
         $RemoveTags = false;  //remove HTML-Tags created during conversion?
 
         $extension = substr(strrchr($uploadFormObj->mDesiredDestName, '.'),1); //extract the extension of the destination filename
 
          switch(strtolower($extension)) //methods for text extraction 
          {
                case "pdf": 
                {                 
                        //using XPDF and iconv for conversion purposes
                        $toexec = "/usr/bin/pdftotext  -raw -nopgbrk " . $uploadFormObj->mTempPath . " -";
                        $toexec.="| /usr/bin/iconv -f ISO-8859-1 -t UTF-8";
                        break;
                }
 
                case "dot": {}
                case "doc": 
                {
                        //using antiword
                        $toexec = "/usr/bin/antiword -s ".$uploadFormObj->mTempPath;
                        break;
                }
 
                case "xls":
                {
                        $toexec = "/usr/bin/xls2csv ".$uploadFormObj->mTempPath;
                        break;
                }
 
                case "ppt":
                {
                        $toexec = "/usr/bin/catppt ".$uploadFormObj->mTempPath; 
                        break;
                }
 
                case "rtf": # any file extension with text in it will be okay here
                {
                        $toexec = "/usr/bin/strings ".$uploadFormObj->mTempPath; # string's output isn't neat, but it works.
                        break;
                }
 
                //OpenOffice.org documents
                case "ods": {}
                case "odp": {}
                case "odg": {}
                case "odt":
                {
                        $toexec = "unzip -p " . $uploadFormObj->mTempPath . " content.xml";
                        $RemoveTags = true;
                        break;
                }
        }
        if ($toexec != "")
        {
                exec($toexec, $DocText);
                $NewDesc = $uploadFormObj->mComment . "\r\n" . "<!-- ";
                foreach ($DocText as $DocLine) 
                {
                        if($RemoveTags == false)
                        {
                                $NewDesc .= "\r\n" . str_replace("-->","",$DocLine);
                        }
                        else
                        {
                                $NewDesc .= "\r\n" . strip_tags(str_replace("-->","",$DocLine));
                        }
                }
                $NewDesc .= "\r\n" . " -->";
                $uploadFormObj->mComment = $NewDesc;
        }
        return $uploadFormObj;
}
 
function isCommandPresent($command)
{
	if(file_exists($command)==false)
	{
		//extract the command from the path
		$lastSlash = strrpos($command, '/');
		if($lastSlash!='')
		{
			$commandWithoutSlashes = substr($command, $lastSlash+1);
		}
		else
		{
			$commandWithoutSlashes = $command;
		}
 
		$toexec = "whereis $commandWithoutSlashes";
		//lookup the command
		exec($toexec, $whereis);
		echo "FileIndexer: The file $command is missing ... whereis result: $whereis[0] <br>";
	}
}
 
/**
  * Add extension information to Special:Version
 */
$wgExtensionCredits['other'][] = array(
        'name' => 'FileIndexer',
        'author' => 'MHart and Flominator',
        'description' => 'makes uploaded documents searchable',
        'url' => 'http://www.mediawiki.org/wiki/Extension:FileIndexer'
        );

[edit] pre MediaWiki 1.11.0

<?php
$wgHooks['UploadForm:BeforeProcessing'][] = 'ScanFileForIndex';
 
 
//Testing if these commands are available
if($wgFileIndexerDebug==true)
{
	isCommandPresent("/usr/bin/pdftotext");
	isCommandPresent("/usr/bin/iconv");
	isCommandPresent("/usr/bin/antiword");
	isCommandPresent("/usr/bin/xls2csv");
	isCommandPresent("/usr/bin/catppt");
	isCommandPresent("/usr/bin/strings");
}
 
function ScanFileForIndex($uploadFormObj) 
{
	 $NewDesc = '';
	 $RemoveTags = false;  //remove HTML-Tags created during conversion?
 
	 $extension = substr(strrchr($uploadFormObj->mDestFile, '.'),1); //extract the extension of the destination filename
 
	  switch(strtolower($extension)) //methods for text extraction 
	  {
		case "pdf": 
		{
			//using XPDF and iconv for conversion purposes
			$toexec = "/usr/local/bin/pdftotext  -raw -nopgbrk " . $uploadFormObj->mUploadTempName . " -";
			// Alternative: $toexec = "/usr/bin/pdftotext  -raw -nopgbrk " . $uploadFormObj->mUploadTempName . " -";
			$toexec.="| iconv -f ISO-8859-1 -t UTF-8";
			break;
		}
 
		case "dot": {}
		case "doc": 
		{
			//using antiword 
			$toexec = "/usr/bin/antiword -s ".$uploadFormObj->mUploadTempName;
			break;
		}
 
		case "xls":
		{
			$toexec = "/usr/bin/xls2csv ".$uploadFormObj->mUploadTempName;
			break;
		}
 
		case "ppt":
		{
			$toexec = "/usr/bin/catppt ".$uploadFormObj->mUploadTempName; 
			break;
		}
 
		case "rtf": # any file extension with text in it will be okay here
		{
			$toexec = "/usr/bin/strings ".$uploadFormObj->mUploadTempName; # string's output isn't neat, but it works.
			break;
		}
 
		//OpenOffice.org documents
		case "ods": {}
		case "odp": {}
		case "odg": {}
		case "odt":
		{
			$toexec = "unzip -p " . $uploadFormObj->mUploadTempName . " content.xml";
			$RemoveTags = true;
			break;
		}
	}
	if ($toexec != "")
	{
		exec($toexec, $DocText);
		$NewDesc = $uploadFormObj->mUploadDescription . "\r\n" . "<!-- ";
		foreach ($DocText as $DocLine) 
		{
			if($RemoveTags == false)
			{
				$NewDesc .= "\r\n" . str_replace("-->","",$DocLine);
			}
			else
			{
				$NewDesc .= "\r\n" . strip_tags(str_replace("-->","",$DocLine));
			}
		}
		$NewDesc .= "\r\n" . " -->";
		$uploadFormObj->mUploadDescription = $NewDesc;
	}
	return $uploadFormObj;
}
 
function isCommandPresent($command)
{
	if(file_exists($command)==false)
	{
		//extract the command from the path
		$lastSlash = strrpos($command, '/');
		if($lastSlash!='')
		{
			$commandWithoutSlashes = substr($command, $lastSlash+1);
		}
		else
		{
			$commandWithoutSlashes = $command;
		}
 
		$toexec = "whereis $commandWithoutSlashes";
		//lookup the command
		exec($toexec, $whereis);
		echo "FileIndexer: The file $command is missing ... whereis result: $whereis[0] <br>";
	}
}
 
/**
  * Add extension information to Special:Version
 */
$wgExtensionCredits['other'][] = array(
	'name' => 'FileIndexer',
	'author' => 'MHart and Flominator',
	'description' => 'makes uploaded documents searchable',
	'url' => 'http://www.mediawiki.org/wiki/Extension:FileIndexer'
	);

[edit] pre MediaWiki 1.9.0

Insert the hook into SpecialUpload.php on MediaWiki 1.6.7 around line 158:

	function processUpload() {
		global $wgUser, $wgOut;
 
		if( !wfRunHooks( 'UploadForm:BeforeProcessing', array( &$this ) ) )
		{
			wfDebug( "Hook 'UploadForm:BeforeProcessing' broke processing the file." );
			return false;
		}
 
		/* Check for PHP error if any, requires php 4.2 or newer */

[edit] All versions

[edit] Edit LocalSettings.php file

Add the following lines to LocalSettings.php:

#Makes uploaded documents searchable
include("extensions/FileIndexer/FileIndexer.php");

[edit] Install appropriate indexing tools in /usr/bin/ directory

Installing these tools depends on the operating system of your server. If you use webspace on the server of a hosting company, you might ask them for help, since you usually can't install software on their servers by yourself.

For debian-like GNU/Linux distributions simply use the command:

apt-get install <toolname>

[edit] Changelog

  • 2007-08-08 basic extension created from the hack below
  • 2007-11-28 support for MediaWiki 1.11 added
  • 2008-05-14 debugging added

[edit] Problems and solutions

[edit] Debug

If you want to know which of the tools are installed, simply add the following line to LocalSettings.php (before the one mentioned above):

$wgFileIndexerDebug = true;

[edit] Any questions

for more hints and a place to ask your questions, see Extension talk:FileIndexer

Personal tools