Extension talk:FileIndexer
From MediaWiki.org
Any user names refer to users of that site, who are not necessarily users of MediaWiki.org (even if they share the same username).
Please, can yo tell me the exakt position, where to insert your script in "specialUpload.php". I use the wiki in local domain.
thanks casio
Contents |
[edit] Unique Words
Perhaps for efficiency, it might be worth using a script to attach unique words only to the Wiki as opposed to the file contents. I might give it a go once I learn how.
I had taken a crack at this. here is an example of my hack for PDF's. Should work with other formats:
.
.
$toexec = "/usr/local/bin/pdftotext " . $this->mSavedFile . " - | tr -d ',\"<>/()\:\?\;1234567890!@#$%^&*{}[]+' | tr ' ' ' \n' | sort | uniq -i";
.
.
Works in my 1.8.2.
[edit] Catdoc
I performed the catdoc 0.94 installation, and my found the xls & ppt executables in /usr/local/bin rather than the /usr/bin as shown above (The pdf and word files installed as shown above for me.) --Erik Heidt, 31 July 2005
[edit] large files
Regarding MySQL error "1153: Got a packet bigger than 'max_allowed_packet' bytes"
In using the above patch to upload large PDFs , I encountered the MYSQL error 1153. This error results when you attempt to execute a SQL statement against mySQL which is larger than the system set default. ( For more information see the error explanation on the mySQL website -> MySql Packet-too-large page )
Rather than increase the size of allowable packets, I decided to truncate the text which is returned in $NewDesc to a value large enough that I "probably" get a sample of text for good searches, but small enough that I don't (1) get this error or (2) commit tons of db storage to a single files index text.
After some research I set the value at 512K of text, here is the code I inserted into the MHart patch from above:
foreach ($DocText as $DocLine) { $NewDesc .= "\r\n" . str_replace("-->","",$DocLine); }<b> # eth: check to see if NewDesc is very large, and truncate it if it is... $tooLarge = 524828; # 512K if (strlen($NewDesc)>$tooLarge) $NewDesc = substr($NewDesc,0,$tooLarge); # eth: end of the large summary change. $NewDesc .= "\r\n" . " -->";
[edit] MW 1.9.3 Problem
I have encountered a problem using this extension with MediaWiki 1.9.3 -> no description was sent to SpecialUpload.php and description field was empty. This was resolved by removing "\r\n" when sending it to $NewDesc
$NewDesc .= " " . str_replace("-->","",$DocLine); ... $NewDesc .= " " . " -->";
--Erik Heidt, 1 September 2005
Note that the above conversation may have been edited or added to since the transfer. If in doubt, check the edit history.
[edit] FileIndexing does not work correct
I've installed the extension and it works (half).
When I upload a PDF file, some text is beeing inserted in the comments field. It looks correct, but the content is just limited to 255 characters. So when i'm uploading any PDF i'm getting in the comments field something like that:
"<!- - Leitfaden zur Nutzung der MP-Protokolle Seite 1 von 6 05.11.2007 Leitfaden zur Nutzung der MP-Protokolle Das Programm zum Anlegen/Bearbeiten der MP-Protokolle ist in Ferryt an der folgenden Stelle zu finden: Menü +Personal ->MP_Protokoll Abbildung"
The problem is, theres much more text than that. When i'm searching e.g. for "Leitfaden" with the search funciton (with all checkboxes activated) i'm getting nothing. Any idea how to solve this problem?
- I was having the same problem, but figured out why it happens. It occurs only when you upload a newer version of an existing file, both text in comments field is limited and searching does not work as you pointed out, the reason being old comments page being preserved. Try uploading that pdf file under a different name. It will be indexed correctly and will be searchable.
[edit] Windows 2003
I had this working with a linux install. I had to move to a 2003 server and the exec paths are wrong or the file permission is wrong or.. windows anyone?
Thanks,
bruceWayne
[edit] New Variant
Hi. I needed to modify this extension for my own purpose and i decided to let you know the result. Feel free to use it or modify yourself.
Anyway its testet for odt and pdf (and as i didn't touch this part of the code to much it should work with all other formats as the original) on a wiki 1.12.0
CHANGES:
- external programs arecalled from /usr/local/bin (fits better for our company)
- $wgGMFileIndexerPrefix and $wgFileIndexerPostfix are used directly before/after the created index instead of strictly use a HTML-comment-tag thats not interpreted by lucene for excample.
- Index isn't stored in the uploadcomment anymore!
- Each word sould be unique in the created index now (there sould be only lowcase-letters and no pure numbers)
- ATTENTION: Non-Germans, please watch the code here - i had to handle german umlauts here!
- Updating a file should update an index now, too.
- For files filtered by deleting tags words sould not collapse after each line anymore.
- When uploading a file you may choose wether or not you want to create an (new) index: default is that no index is created! To create an index you need to tell this in the uploaddescription of the uploadform by writing the (afterwards automatically removed) string "FI::MakeIndex"
- This is my very first try to create an extension and if someone wants to help, please feel free to replace this by an additional checkbox in the uploadform! :-)
- $wgFileIndexerMinWordLen sets the minimum length of any word in the index (default = 3)
KNOWN BUGS
- When receiving a warning-page after the uploadform, no index will be created! Thats the reason the original stored the index in the comment, i think but i don't see its worth the extra dataoverload, so i will come to this later.
- pdftotext does for some unknown reason not return any minus-signs. Because of that some words collapse.
Have fun raZe --195.216.198.100 13:51, 17 June 2008 (UTC)
<?php /** * Extension: FileIndexer * * REVISION-AUTHOR: Ramon Dohle aka 'raZe' * VERSION: 0.1.3 * * Diese Erweiterung basiert auf der Wiki-Erweiterung 'FileIndexer' vom Stand 15.05.2008. * Wie sein Vorgaenger soll sie Dateien Indezierbar machen. * Entwickelt wurde diese Version der Erweiterung unter MediaWiki 1.12.0 * * AENDERUNGEN ZUM ORIGINAL: * - Erwartetes Verzeichnis fuer Filterprogramme: /usr/local/bin/ * - $wgFileIndexerPrefix und $wgFileIndexerPostfix umgeben den erzeugten Index. * Ich habe die Erfahrung gemacht, dass die Lucene-Search-Engine Kommentare (invisible) nicht im Index speichert. * Daher habe ich die Moeglichkeit vorgesehen diesen in einer eigenen Sektion ohne Kommentar zu hinterlegen. * - Der Text wird nicht mehr komplett, sondern nur noch die gefundenen Worte *jeweils ein mal* in Kleinbuchstaben (ohne reine Zahlenwerte) hinterlegt. * - Der Index wird nicht mehr unnoetigerweise zusaetzlich im Kommentar zum Upload abgelegt. * - Bei erneut hochgeladenen Dateien wird der hinterlegte Index nun aktualisiert * - Dateien aus denen Tags entfernt wurden, sollten Worte verschiedener Felder nun nicht mehr zsuammenfallen. * - WICHTIG: Index wird nur erstellt, wenn dies im Upload-Kommentar durch Gebrauch des Zeichens 'FI::MakeIndex' bestimmt wird! * TODO: Hier werde ich vielleicht noch mal ansetzen und eine Checkbox zu diesem Zweck erstellen. * - $wgFileIndexerMinWordLen bestimmt die minimale Wordlaenge der Indexworte (voreingestellt 3 Zeichen). * * TODO: * - Das Kommando 'FI::MakeIndex' durch eine Checkbox im Uploadformular ersetzen * - Index über Warnungen nach dem Uploadformular hinweg transportieren * * BEKANNTE SCHWAECHEN: * - pdftotext liefert keine Minuszeichen "-" zurueck. Dadurch fallen manche Worte leider zusammen. */ $wgHooks['UploadForm:BeforeProcessing'][] = 'ScanFileForIndex'; $wgHooks['ArticleSave'][] = 'gmArticleSaveFileIndexer'; $wgHooks['UploadComplete'][] = 'gmUploadCompleteFileIndexer'; //Testing if these commands are available if($wgFileIndexerDebug==true) { isCommandPresent("/usr/local/bin/pdftotext"); isCommandPresent("/usr/local/bin/iconv"); isCommandPresent("/usr/local/bin/antiword"); isCommandPresent("/usr/local/bin/xls2csv"); isCommandPresent("/usr/local/bin/catppt"); isCommandPresent("/usr/local/bin/strings"); } $gm_NewFileIndex = false; /** * Diese Hook-Funktion wird nach dem erfolgreichen Upload einer Datei aufgerufen * und stoesst den Update des zur Datei gehoerigen Artikels an, sollte ein zu aktualisierender * Index-Sektions-Inhalt erstellt worden sein. */ function gmUploadCompleteFileIndexer(&$image){ if($gm_NewFileIndex !== false){ $article = new Article( $image->mLocalFile->getTitle() ); $article->loadContent(); $article->doEdit($article->mContent, "gmFileIndexer: Datei hochgeladen.\n"); } return true; } /** * Diese Hook-Funktion aktualisiert die Index-Sektion, sollte es sich um einen FileUpload handeln * und ein neuer Inhalt fuer diese Sektion vorbereitet worden sein. * In jedem Fall wird die global abgelegte Index-Sektions-Inhalts-Variable wieder geleert. */ function gmArticleSaveFileIndexer(&$article, &$user, &$text, &$summary, $minor, $watch, $sectionanchor, &$flags){ global $gm_NewFileIndex, $wgFileIndexerPrefix, $wgFileIndexerPostfix; if($gm_NewFileIndex !== false){ // Neuen Index setzen ... // Suchen und ersetzen des Datei-Indexbereiches $sOldDescription = $text; $text = ""; // Nach $wgFileIndexerPrefix und Postfix suchen und alten Text davor uebertragen... $iPostFileIndexPos = false; $iFileIndexPos = strpos($sOldDescription, $wgFileIndexerPrefix); if($iFileIndexPos === false){ // Kompletten Inhalt vor neuen Index packen, weil vorher anscheinend kein Index existierte... $text = substr($sOldDescription, 0); } else{ // Alles vor altem Index schon mal in den neuen Inhalt packen... $text = substr($sOldDescription, 0, $iFileIndexPos); // Nur wenn auch ein Prefix gefunden wurde, wird nach einen Postfix gesucht... $iPostFileIndexPos = strpos($sOldDescription, $wgFileIndexerPostfix, $iFileIndexPos); } // Index selbst anfuegen... (beim ersten erzeugen wird darauf geachtet, dass dieser in einer neuen Zeile beginnt) if($iPostFileIndexPos === false && (substr($text, strlen($text) - 1, 1) != "\n")){ $text .= "\n"; } $text .= $gm_NewFileIndex; // Restliche Beschreibung wieder anfuegen... if($iPostFileIndexPos !== false){ $text .= substr($sOldDescription, $iPostFileIndexPos + strlen($wgFileIndexerPostfix)); } // In der Summary vermerken, dass hier ein neuer Index eingefuegt wurde... $summary .= ((substr($summary, strlen($summary) - 1, 1) == "\n") ? "" : "\n") . "gmFileIndexer: Neuer Index erstellt.\n"; } // Index wieder entfernen ... $gm_NewFileIndex = false; return true; } /** * Sucht im Kommentar nach dem Zeichen zur Indexerzeugung und erstellt unter Einsatz externer Programme den Index. Dieser wird zunaechst * in einer globalen Variable abgelegt um spaeter von anderen Funktionen verarbeitet zu werden. */ function ScanFileForIndex(&$uploadFormObj) { global $gm_NewFileIndex, $wgFileIndexerPrefix, $wgFileIndexerPostfix, $wgFileIndexerMinWordLen; $wgFileIndexerMinWordLen = ($wgFileIndexerMinWordLen > 0) ? $wgFileIndexerMinWordLen : 3; $SIGN_CREATE_INDEX = "FI::MakeIndex"; // Im Kommentar zum Upload wird geschaut, ob der Index erzeugt werden soll. Wenn ja, dann entferne das Zeichen aus dem Kommentar. Ansonsten gehe ohne getane Arbeit raus. $iSignCreateIndexPos = strpos($uploadFormObj->mComment, $SIGN_CREATE_INDEX); if($iSignCreateIndexPos === false){ return true; } else{ $uploadFormObj->mComment = substr($uploadFormObj->mComment, 0, $iSignCreateIndexPos) . substr($uploadFormObj->mComment, $iSignCreateIndexPos + strlen($SIGN_CREATE_INDEX)); } $NewDesc = ''; $RemoveTags = false; //remove HTML-Tags created during conversion? $extension = substr(strrchr($uploadFormObj->mDesiredDestName, '.'),1); //extract the extension of the destination filename switch(strtolower($extension)) //methods for text extraction { case "pdf": { //using XPDF and iconv for conversion purposes $toexec = "/usr/local/bin/pdftotext -raw -nopgbrk " . $uploadFormObj->mTempPath . " -"; $toexec.="| /usr/local/bin/iconv -f ISO-8859-1 -t UTF-8"; break; } case "dot": {} case "doc": { //using antiword $toexec = "/usr/local/bin/antiword -s ".$uploadFormObj->mTempPath; break; } case "xls": { $toexec = "/usr/local/bin/xls2csv ".$uploadFormObj->mTempPath; break; } case "ppt": { $toexec = "/usr/local/bin/catppt ".$uploadFormObj->mTempPath; break; } case "rtf": # any file extension with text in it will be okay here { $toexec = "/usr/local/bin/strings ".$uploadFormObj->mTempPath; # string's output isn't neat, but it works. break; } //OpenOffice.org documents case "ods": {} case "odp": {} case "odg": {} case "odt": { $toexec = "unzip -p " . $uploadFormObj->mTempPath . " content.xml"; $RemoveTags = true; break; } } if ($toexec != "") { exec($toexec, $DocText); $gm_NewFileIndex = $wgFileIndexerPrefix; $aIndex = array(); foreach ($DocText as $DocLine) { if($RemoveTags) { // Tags entfernen... Vorher vor jedem "<" Leerzeichen einfuegen, damit keine Worte zusammenfallen! $DocLine = strip_tags(str_replace("<", " <", $DocLine)); } // Sonderzeichen entfernen... // ATTENTION: German only! Umlaute werden durch strtolower nicht in Kleinbuchstaben gewandelt... $DocLine = strtolower(ereg_replace("[[:punct:]][[:space:]]|[[:space:]][[:punct:]]|[[:punct:]][[:punct:]]", " ", ereg_replace("Ä", "ä", ereg_replace("Ö", "ö", ereg_replace("Ü", "ü", $DocLine))))); // Worte filtern und in Index packen... $aSplit = split(" ", $DocLine); foreach($aSplit as $sWord){ if($sWord != "" && !is_numeric($sWord) && strlen($sWord) >= $wgFileIndexerMinWordLen){ $aIndex[$sWord] = true; } } } // Index global setzen... foreach(array_keys($aIndex) as $skeyword){ $gm_NewFileIndex .= $skeyword . " "; } $gm_NewFileIndex .= $wgFileIndexerPostfix; } return true; } function isCommandPresent($command) { if(file_exists($command)==false) { //extract the command from the path $lastSlash = strrpos($command, '/'); if($lastSlash!='') { $commandWithoutSlashes = substr($command, $lastSlash+1); } else { $commandWithoutSlashes = $command; } $toexec = "whereis $commandWithoutSlashes"; //lookup the command exec($toexec, $whereis); echo "FileIndexer: The file $command is missing ... whereis result: $whereis[0] <br>"; } } /** * Add extension information to Special:Version */ $wgExtensionCredits['other'][] = array( 'name' => 'FileIndexer', 'version' => '0.1.3', 'author' => 'Ramon Dohle (raZe) | Original: MHart and Flominator', 'description' => 'Index-Erzeugung aus hochgeladenen Dateien zur Erfassung durch Suchfunktionen', 'url' => 'http://www.mediawiki.org/wiki/Extension:FileIndexer' );

