Extension:FileIndexer
From MediaWiki.org
|
FileIndexer Release status: unknown |
|||
|---|---|---|---|
| Implementation | Special page | ||
| Description | Makes uploaded document files searchable | ||
| Author(s) | MHart, Flominator, Hxwiki | ||
| MediaWiki | 1.9.0 - 1.11 (Versions before require patching) |
||
| License | No license specified | ||
| Download | no link | ||
|
|||
Contents |
[edit] History
MHart modified the standard upload page so that uploaded Microsoft Word, Microsoft Excel, and Adobe PDF documents and will have their contents indexable. In addition, documents are now indexable.
He started by downloading and installing various Linux command line utilities that will take one of the above formats and output the text - antiword, xls2csv, pdftotext, and ppt2text (catppt).
Then he modified SpecialUpload.php where it tests for a successful upload and just before it inserts the uploaded file information into the database. What it did was make the text of the word document an HTML comment block in the description text of the image's file page.
Two years later Flominator found the hook UploadForm:BeforeProcessing and created an extension out of it. Then Hxwiki came and modified the code to work with MediaWiki 1.11.
Caution: A user must change their preferences to search Images to be able to search the image's page.
[edit] Installation
Put this text to extensions/FileIndexer/FileIndexer.php ...
[edit] MediaWiki >= 1.11.0
<?php $wgHooks['UploadForm:BeforeProcessing'][] = 'ScanFileForIndex'; //Testing if these commands are available if($wgFileIndexerDebug==true) { isCommandPresent("/usr/bin/pdftotext"); isCommandPresent("/usr/bin/iconv"); isCommandPresent("/usr/bin/antiword"); isCommandPresent("/usr/bin/xls2csv"); isCommandPresent("/usr/bin/catppt"); isCommandPresent("/usr/bin/strings"); } function ScanFileForIndex($uploadFormObj) { $NewDesc = ''; $RemoveTags = false; //remove HTML-Tags created during conversion? $extension = substr(strrchr($uploadFormObj->mDesiredDestName, '.'),1); //extract the extension of the destination filename switch(strtolower($extension)) //methods for text extraction { case "pdf": { //using XPDF and iconv for conversion purposes $toexec = "/usr/bin/pdftotext -raw -nopgbrk " . $uploadFormObj->mTempPath . " -"; $toexec.="| /usr/bin/iconv -f ISO-8859-1 -t UTF-8"; break; } case "dot": {} case "doc": { //using antiword $toexec = "/usr/bin/antiword -s ".$uploadFormObj->mTempPath; break; } case "xls": { $toexec = "/usr/bin/xls2csv ".$uploadFormObj->mTempPath; break; } case "ppt": { $toexec = "/usr/bin/catppt ".$uploadFormObj->mTempPath; break; } case "rtf": # any file extension with text in it will be okay here { $toexec = "/usr/bin/strings ".$uploadFormObj->mTempPath; # string's output isn't neat, but it works. break; } //OpenOffice.org documents case "ods": {} case "odp": {} case "odg": {} case "odt": { $toexec = "unzip -p " . $uploadFormObj->mTempPath . " content.xml"; $RemoveTags = true; break; } } if ($toexec != "") { exec($toexec, $DocText); $NewDesc = $uploadFormObj->mComment . "\r\n" . "<!-- "; foreach ($DocText as $DocLine) { if($RemoveTags == false) { $NewDesc .= "\r\n" . str_replace("-->","",$DocLine); } else { $NewDesc .= "\r\n" . strip_tags(str_replace("-->","",$DocLine)); } } $NewDesc .= "\r\n" . " -->"; $uploadFormObj->mComment = $NewDesc; } return $uploadFormObj; } function isCommandPresent($command) { if(file_exists($command)==false) { //extract the command from the path $lastSlash = strrpos($command, '/'); if($lastSlash!='') { $commandWithoutSlashes = substr($command, $lastSlash+1); } else { $commandWithoutSlashes = $command; } $toexec = "whereis $commandWithoutSlashes"; //lookup the command exec($toexec, $whereis); echo "FileIndexer: The file $command is missing ... whereis result: $whereis[0] <br>"; } } /** * Add extension information to Special:Version */ $wgExtensionCredits['other'][] = array( 'name' => 'FileIndexer', 'author' => 'MHart and Flominator', 'description' => 'makes uploaded documents searchable', 'url' => 'http://www.mediawiki.org/wiki/Extension:FileIndexer' );
[edit] pre MediaWiki 1.11.0
<?php $wgHooks['UploadForm:BeforeProcessing'][] = 'ScanFileForIndex'; //Testing if these commands are available if($wgFileIndexerDebug==true) { isCommandPresent("/usr/bin/pdftotext"); isCommandPresent("/usr/bin/iconv"); isCommandPresent("/usr/bin/antiword"); isCommandPresent("/usr/bin/xls2csv"); isCommandPresent("/usr/bin/catppt"); isCommandPresent("/usr/bin/strings"); } function ScanFileForIndex($uploadFormObj) { $NewDesc = ''; $RemoveTags = false; //remove HTML-Tags created during conversion? $extension = substr(strrchr($uploadFormObj->mDestFile, '.'),1); //extract the extension of the destination filename switch(strtolower($extension)) //methods for text extraction { case "pdf": { //using XPDF and iconv for conversion purposes $toexec = "/usr/local/bin/pdftotext -raw -nopgbrk " . $uploadFormObj->mUploadTempName . " -"; // Alternative: $toexec = "/usr/bin/pdftotext -raw -nopgbrk " . $uploadFormObj->mUploadTempName . " -"; $toexec.="| iconv -f ISO-8859-1 -t UTF-8"; break; } case "dot": {} case "doc": { //using antiword $toexec = "/usr/bin/antiword -s ".$uploadFormObj->mUploadTempName; break; } case "xls": { $toexec = "/usr/bin/xls2csv ".$uploadFormObj->mUploadTempName; break; } case "ppt": { $toexec = "/usr/bin/catppt ".$uploadFormObj->mUploadTempName; break; } case "rtf": # any file extension with text in it will be okay here { $toexec = "/usr/bin/strings ".$uploadFormObj->mUploadTempName; # string's output isn't neat, but it works. break; } //OpenOffice.org documents case "ods": {} case "odp": {} case "odg": {} case "odt": { $toexec = "unzip -p " . $uploadFormObj->mUploadTempName . " content.xml"; $RemoveTags = true; break; } } if ($toexec != "") { exec($toexec, $DocText); $NewDesc = $uploadFormObj->mUploadDescription . "\r\n" . "<!-- "; foreach ($DocText as $DocLine) { if($RemoveTags == false) { $NewDesc .= "\r\n" . str_replace("-->","",$DocLine); } else { $NewDesc .= "\r\n" . strip_tags(str_replace("-->","",$DocLine)); } } $NewDesc .= "\r\n" . " -->"; $uploadFormObj->mUploadDescription = $NewDesc; } return $uploadFormObj; } function isCommandPresent($command) { if(file_exists($command)==false) { //extract the command from the path $lastSlash = strrpos($command, '/'); if($lastSlash!='') { $commandWithoutSlashes = substr($command, $lastSlash+1); } else { $commandWithoutSlashes = $command; } $toexec = "whereis $commandWithoutSlashes"; //lookup the command exec($toexec, $whereis); echo "FileIndexer: The file $command is missing ... whereis result: $whereis[0] <br>"; } } /** * Add extension information to Special:Version */ $wgExtensionCredits['other'][] = array( 'name' => 'FileIndexer', 'author' => 'MHart and Flominator', 'description' => 'makes uploaded documents searchable', 'url' => 'http://www.mediawiki.org/wiki/Extension:FileIndexer' );
[edit] pre MediaWiki 1.9.0
Insert the hook into SpecialUpload.php on MediaWiki 1.6.7 around line 158:
function processUpload() { global $wgUser, $wgOut; if( !wfRunHooks( 'UploadForm:BeforeProcessing', array( &$this ) ) ) { wfDebug( "Hook 'UploadForm:BeforeProcessing' broke processing the file." ); return false; } /* Check for PHP error if any, requires php 4.2 or newer */
[edit] All versions
[edit] Edit LocalSettings.php file
Add the following lines to LocalSettings.php:
#Makes uploaded documents searchable include("extensions/FileIndexer/FileIndexer.php");
[edit] Install appropriate indexing tools in /usr/bin/ directory
- /usr/bin/pdftotext - http://www.foolabs.com/xpdf/
- /usr/bin/iconv - http://www.gnu.org/software/libiconv/
- /usr/bin/antiword - http://www.winfield.demon.nl/
- /usr/bin/xls2csv - catdoc
- /usr/bin/catppt - catdoc
- /usr/bin/strings
Installing these tools depends on the operating system of your server. If you use webspace on the server of a hosting company, you might ask them for help, since you usually can't install software on their servers by yourself.
For debian-like GNU/Linux distributions simply use the command:
apt-get install <toolname>
[edit] Changelog
- 2007-08-08 basic extension created from the hack below
- 2007-11-28 support for MediaWiki 1.11 added
- 2008-05-14 debugging added
[edit] Problems and solutions
[edit] Debug
If you want to know which of the tools are installed, simply add the following line to LocalSettings.php (before the one mentioned above):
$wgFileIndexerDebug = true;
[edit] Any questions
for more hints and a place to ask your questions, see Extension talk:FileIndexer

