Extension talk:FileIndexer

Please, can yo tell me the exakt position, where to insert your script in "specialUpload.php". I use the wiki in local domain.

thanks casio

Unique Words
Perhaps for efficiency, it might be worth using a script to attach unique words only to the Wiki as opposed to the file contents. I might give it a go once I learn how.

I had taken a crack at this. here is an example of my hack for PDF's. Should work with other formats:

. . $toexec = "/usr/local/bin/pdftotext ". $this->mSavedFile. " - | tr -d ',\"<>/\:\?\;1234567890!@#$%^&*{}[]+' | tr ' ' ' \n' | sort | uniq -i"; ..

Works in my 1.8.2.

Catdoc
I performed the catdoc 0.94 installation, and my found the xls & ppt executables in /usr/local/bin rather than the /usr/bin as shown above (The pdf and word files installed as shown above for me.) --Erik Heidt, 31 July 2005

large files
Regarding MySQL error "1153: Got a packet bigger than 'max_allowed_packet' bytes"

In using the above patch to upload large PDFs, I encountered the MYSQL error 1153. This error results when you attempt to execute a SQL statement against mySQL which is larger than the system set default. ( For more information see the error explanation on the mySQL website -> MySql Packet-too-large page )

Rather than increase the size of allowable packets, I decided to truncate the text which is returned in $NewDesc to a value large enough that I "probably" get a sample of text for good searches, but small enough that I don't (1) get this error or (2) commit tons of db storage to a single files index text.

After some research I set the value at 512K of text, here is the code I inserted into the MHart patch from above:

MW 1.9.3 Problem
I have encountered a problem using this extension with MediaWiki 1.9.3 -> no description was sent to SpecialUpload.php and description field was empty. This was resolved by removing "\r\n" when sending it to $NewDesc

--Erik Heidt, 1 September 2005

FileIndexing does not work correct
I've installed the extension and it works (half).

When I upload a PDF file, some text is beeing inserted in the comments field. It looks correct, but the content is just limited to 255 characters. So when i'm uploading any PDF i'm getting in the comments field something like that:

"<!- - Leitfaden zur Nutzung der MP-Protokolle Seite 1 von 6 05.11.2007 Leitfaden zur Nutzung der MP-Protokolle Das Programm zum Anlegen/Bearbeiten der MP-Protokolle ist in Ferryt an der folgenden Stelle zu finden: Menü +Personal ->MP_Protokoll Abbildung"

The problem is, theres much more text than that. When i'm searching e.g. for "Leitfaden" with the search funciton (with all checkboxes activated) i'm getting nothing. Any idea how to solve this problem?


 * I was having the same problem, but figured out why it happens. It occurs only when you upload a newer version of an existing file, both text in comments field is limited and searching does not work as you pointed out, the reason being old comments page being preserved. Try uploading that pdf file under a different name. It will be indexed correctly and will be searchable.

Windows 2003
I had this working with a linux install. I had to move to a 2003 server and the exec paths are wrong or the file permission is wrong or.. windows anyone? Thanks, bruceWayne

New Variant
Hi. I needed to modify this extension for my own purpose and i decided to let you know the result. Feel free to use it or modify yourself. Anyway its testet for odt and pdf (and as i didn't touch this part of the code to much it should work with all other formats as the original) on a wiki 1.12.0

CHANGES:
 * external programs arecalled from /usr/local/bin (fits better for our company)
 * $wgGMFileIndexerPrefix and $wgFileIndexerPostfix are used directly before/after the created index instead of strictly use a HTML-comment-tag thats not interpreted by lucene for excample.
 * Index isn't stored in the uploadcomment anymore!
 * Each word sould be unique in the created index now (there sould be only lowcase-letters and no pure numbers)
 * ATTENTION: Non-Germans, please watch the code here - i had to handle german umlauts here!
 * Updating a file should update an index now, too.
 * For files filtered by deleting tags words sould not collapse after each line anymore.
 * When uploading a file you may choose wether or not you want to create an (new) index: default is that no index is created! To create an index you need to tell this in the uploaddescription of the uploadform by writing the (afterwards automatically removed) string "FI::MakeIndex"
 * This is my very first try to create an extension and if someone wants to help, please feel free to replace this by an additional checkbox in the uploadform! :-)
 * $wgFileIndexerMinWordLen sets the minimum length of any word in the index (default = 3)

KNOWN BUGS
 * When receiving a warning-page after the uploadform, no index will be created! Thats the reason the original stored the index in the comment, i think but i don't see its worth the extra dataoverload, so i will come to this later.
 * pdftotext does for some unknown reason not return any minus-signs. Because of that some words collapse.

Have fun raZe --195.216.198.100 13:51, 17 June 2008 (UTC)

Request
It works well in my 1.12.

Request handling of postscript .ps files.
 * Which one? The one above or the one on Extension:FileIndexer? Do you know any command line tool that turns ps-files to plain text? --Flominator 12:39, 16 August 2008 (UTC)

MW 1.13.1 does it works?
I tested it and it does not work :(

Has anyone the same problem ?

Greetings Thomas (15.09.2009)
 * Hi Thomas, what happens? Have you tried Extension:FileIndexer? --Flominator 16:59, 15 September 2008 (UTC)

Special Characters
I've been using the variant above, with a few tweaks.

One thing I noticed was that sometimes pdftotext returns a warning if the pdf files are version 1.6. It still works but I decided it'd be a good idea to redirect stderr and ignore it, see near line 150

//using XPDF and iconv for conversion purposes $toexec = "/usr/bin/pdftotext -raw -nopgbrk ". $uploadFormObj->mTempPath. " - 2> /dev/null"; //modified this line $toexec.="| /usr/bin/iconv -f ISO-8859-1 -t UTF-8";

Attaching the index text sometimes fails when extended characters are left in the index text, so I added something to remove that, after line 211:

$DocLine = eregi_replace("[©°ÃÂ·\|¦?­]|space:.space:", "", $DocLine); //added this line // Worte filtern und in Index packen... $aSplit = split(" ", $DocLine); I've not quite got the line numbers right, so I've left in some surrounding lines for context. --Mark P01 22:16, 20 November 2008 (UTC)

pdftotext on windows
Works well on 1.11.0 windows. had lots of problems with pdftotext (Xpdf), which failed miserably/quietly on many pdfs. pdfbox works on our (various) pdfs in our setup. Also was confused for a little while thinking that only about 230 odd characters were being inserted into the image index; this is not so, it's just the default display for comment field, and you can see whole comment when you edit page. Also changed $wgUploadSizeWarning in LocalSettings.php to a much more reasonable value, as the default image size warning of 150K ends up throwing a CGI error page when you confirm upload.

Can also index SVG
Since we've also using inkscape to embed SVG images on our windows mediawiki (because IE doesn't support SVG native)..we've been able to use the FileIndexer extension to index our SVG diagrams. The svg files merely require all the tags stripped apart from tspan and text.