Extension talk:FileIndexer

Unique Words
Perhaps for efficiency, it might be worth using a script to attach unique words only to the Wiki as opposed to the file contents. I might give it a go once I learn how.

I had taken a crack at this. here is an example of my hack for PDF's. Should work with other formats:

. . $toexec = "/usr/local/bin/pdftotext ". $this->mSavedFile. " - | tr -d ',\"<>/\:\?\;1234567890!@#$%^&*{}[]+' | tr ' ' ' \n' | sort | uniq -i"; ..

Works in my 1.8.2.

Catdoc
I performed the catdoc 0.94 installation, and my found the xls & ppt executables in /usr/local/bin rather than the /usr/bin as shown above (The pdf and word files installed as shown above for me.) --Erik Heidt, 31 July 2005

large files
Regarding MySQL error "1153: Got a packet bigger than 'max_allowed_packet' bytes"

In using the above patch to upload large PDFs, I encountered the MYSQL error 1153. This error results when you attempt to execute a SQL statement against mySQL which is larger than the system set default. ( For more information see the error explanation on the mySQL website -> MySql Packet-too-large page )

Rather than increase the size of allowable packets, I decided to truncate the text which is returned in $NewDesc to a value large enough that I "probably" get a sample of text for good searches, but small enough that I don't (1) get this error or (2) commit tons of db storage to a single files index text.

After some research I set the value at 512K of text, here is the code I inserted into the MHart patch from above:

MW 1.9.3 Problem
I have encountered a problem using this extension with MediaWiki 1.9.3 -> no description was sent to SpecialUpload.php and description field was empty. This was resolved by removing "\r\n" when sending it to $NewDesc

--Erik Heidt, 1 September 2005

FileIndexing does not work correct
I've installed the extension and it works (half).

When I upload a PDF file, some text is beeing inserted in the comments field. It looks correct, but the content is just limited to 255 characters. So when i'm uploading any PDF i'm getting in the comments field something like that:

"<!- - Leitfaden zur Nutzung der MP-Protokolle Seite 1 von 6 05.11.2007 Leitfaden zur Nutzung der MP-Protokolle Das Programm zum Anlegen/Bearbeiten der MP-Protokolle ist in Ferryt an der folgenden Stelle zu finden: Menü +Personal ->MP_Protokoll Abbildung"

The problem is, theres much more text than that. When i'm searching e.g. for "Leitfaden" with the search funciton (with all checkboxes activated) i'm getting nothing. Any idea how to solve this problem?


 * I was having the same problem, but figured out why it happens. It occurs only when you upload a newer version of an existing file, both text in comments field is limited and searching does not work as you pointed out, the reason being old comments page being preserved. Try uploading that pdf file under a different name. It will be indexed correctly and will be searchable.

Windows 2003
I had this working with a linux install. I had to move to a 2003 server and the exec paths are wrong or the file permission is wrong or.. windows anyone? Thanks, bruceWayne

MW 1.13.1 does it works?
I tested it and it does not work :(

Has anyone the same problem ?

Greetings Thomas (15.09.2009)
 * Hi Thomas, what happens? Have you tried Extension:FileIndexer? --Flominator 16:59, 15 September 2008 (UTC)


 * I installed it on my 1.13. The stripping of the text, and adding it to the document's comment works. But when I do a search, it just doesn't return anything from there. I even rebuilt the index from the maintenance script manually..


 * solved

Special Characters
I've been using the variant above, with a few tweaks.

One thing I noticed was that sometimes pdftotext returns a warning if the pdf files are version 1.6. It still works but I decided it'd be a good idea to redirect stderr and ignore it, see near line 150

//using XPDF and iconv for conversion purposes $toexec = "/usr/bin/pdftotext -raw -nopgbrk ". $uploadFormObj->mTempPath. " - 2> /dev/null"; //modified this line $toexec.="| /usr/bin/iconv -f ISO-8859-1 -t UTF-8";

Attaching the index text sometimes fails when extended characters are left in the index text, so I added something to remove that, after line 211:

$DocLine = eregi_replace("[©°ÃÂ·\|¦?­]|space:.space:", "", $DocLine); //added this line // Worte filtern und in Index packen... $aSplit = split(" ", $DocLine); I've not quite got the line numbers right, so I've left in some surrounding lines for context. --Mark P01 22:16, 20 November 2008 (UTC)

pdftotext on windows
Works well on 1.11.0 windows. had lots of problems with pdftotext (Xpdf), which failed miserably/quietly on many pdfs. pdfbox works on our (various) pdfs in our setup. Also was confused for a little while thinking that only about 230 odd characters were being inserted into the image index; this is not so, it's just the default display for comment field, and you can see whole comment when you edit page. Also changed $wgUploadSizeWarning in LocalSettings.php to a much more reasonable value, as the default image size warning of 150K ends up throwing a CGI error page when you confirm upload.

Can also index SVG
Since we've also using inkscape to embed SVG images on our windows mediawiki (because IE doesn't support SVG native)..we've been able to use the FileIndexer extension to index our SVG diagrams. The svg files merely require all the tags stripped apart from tspan and text.

pdf upload custom icon/thumb for page?
i've downloaded ghostscript and can generate a thumbnail of the pdf first page on the command line. can't figure out how to integrate it into this extension, so as to have a nice icon/thumbnail for pdf instead of the ugly adobe logo one.
 * Theoretically you'd have to do another file upload with the image file, so you can include it into the image description. Another way would be to put the image somewhere and include it as html image tag (of course you'd have to enable the usage of external images first). --Flominator 16:04, 24 January 2009 (UTC)


 * I don't know enough about mediawiki workings...however i found the code where it normally generates the standard adobe icon (on the fly.) My idea is to add code to display a png version of same name if exists and fallback to the adobe icon, and the fileindexer extension can generate and store the png as well as doing the text extract.  I'm having difficulty working out which variables and image directory to use, so as to make it work similarly to other (png etc) images.  Or do I just call a mediawiki STORE method?  Have to make it nice and easy for everyone, all part of the same/simple file upload operation. -- Chris 26 January 2009.

Question about specialUpload.php
Please, can yo tell me the exakt position, where to insert your script in "specialUpload.php". I use the wiki in local domain.

thanks casio
 * Hi Casio, you have to insert it somewhere above these two lines:

--Flominator 12:34, 27 June 2009 (UTC)

New Variant
Till today there was my first version of this extension here. As I released a much more advanced version on the main page, I felt that this was only blowing this talk_page up. For people wondering I wanted to leave this note... feel free to visit an earlier revision or please try the current version on the main page. It should do properly for users of the just erased version. --Razqubik 08:33, 2 July 2009 (UTC)

— Feel like posting a link to the old version? because I'm finding loads of bugs in your new version. You need to escape the pipe character in your $sExecutionCommand var for pdfs for example. I cant seem to get anywhere, the $sDocLine in the for loop show up empty. I have tested it with this pdf and it still doesnt work : http://finaid.georgetown.edu/sample.pdf - Thanks, Brendan

Problem with upload FileIndexer.php
When I upload a file using the importImages.php maintenance script i get these two errors. I have also noticed that path names for the tools are hardcoded ,wouldnt these be better as varibales? On my centOS box the tools go to /usr/bin and the script expects them in /usr/local/bin. Not a big deal, i created symlinks but something to think about. Also can you provide a tar.gz/zip release? It sucks to have to copy and paste, and can cause errors if doing it to a terminal using vim as i found out...

Notice: Undefined variable: Article in /export/home/www/default/wiki/extensions/FileIndexer/FileIndexer.php on line 394 PHP Notice: Trying to get property of non-object in /export/home/www/default/wiki/extensions/FileIndexer/FileIndexer.php on line 394

So far i have not managed to get a single one of my 1400 files to be indexed, depsite having the fileindexer extension set to index every file uploaded. I thought it was because I was using the maintenance script, but it looks like FileIndexer.php is being called. Could be it be with the error shown above? Thanks Brendan

Edit : Looks like the error is because the var $Author should be named $oAuthor....
 * Thanx for the advice - you are right (even if it is $oArticle ;-) ) - also sorry for beeing absent for that long time --RaZe 08:47, 8 April 2010 (UTC)

Script is running but no index written
System: Windows 2000 XAMPP 1.7.1 Mediawiki 1.15.0

- Installed pdftotext, iconv, catdoc, catppt. Works on command line. - Switched on $wgFiCreateIndexOnAllUploads = true; - $wgFiRequestIndexCreationFile = "\tmp";

Script is running,

FiAutoIndex

is showing up in the comment area of the uploaded file. If $wgFiCheckSystem is set to true, programs for conversion are found (no error messages for the programs displayed).

Can there be an issue with the paths linux vs. Windows (for example with $sFileHashPath)? How would the path to the conversion programs have to look like?

Any help is appreciated...

Thanks in advance!

- This also happens on Linux in MediaWiki 1.15.1. The old version for MediaWiki 1.11 still works for version 1.15.1.
 * As mentioned later by someone there are some fixes to my extension to run with 1.15.x . See http://www.mediawiki.org/wiki/Extension_talk:FileIndexer#Update_for_MediaWiki_1.15.1_.3F
 * --RaZe 08:54, 8 April 2010 (UTC)

'''Thanks to the writer of the tip above, it worked! Just changed the dirs for the applications to 'D:\xxxxxxxx\pdftotext.exe '... Everything else left as is.'''
 * Yes this ist coded for linux - not windows but changing the paths should have helped, though I didnt check if all apps use a corresponding cli. --RaZe 08:54, 8 April 2010 (UTC)

Where did everything go?
I'm using the Ubuntu 8.04 based Turnkey Mediawiki appliance.

MediaWiki 	1.14.0 PHP 	5.2.4-2ubuntu5.5 (apache2handler) MySQL 	5.0.51a-3ubuntu5.4

I apt-got the prerequisite packages.

I added the line to my LocalSettings.php.

I created the three files in the FileIndexer directory.

I browsed to my main page and I am rewarded with...

A blank default page and no trace that my previous wiki content ever existed.

Help!
 * I don't know what to say anything other than I do not think this can be caused by this extension... hopefully you figured out the problem allready --RaZe 08:57, 8 April 2010 (UTC)

How do you index existing documents
Didn't see this on the ext page??


 * Try going to the page Special:FileIndexer and put the article title in the edit box and click create. It explain how on the page as well. Hope this helps. Johannekie


 * I will remember to make this more clear on the Extensionpage next time ;-) --RaZe 08:59, 8 April 2010 (UTC)

Searching not working

 * MediaWiki = 1.13.3


 * PHP = 5.2.6-3ubuntu4.2 (apache2handler)


 * MySQL = 5.0.75-0ubuntu10.2

I created the files as descibed in this extension. And added the include to the LocalSettings.php. I'm not sure If I had to add the part under Historical somewhere, but I didn't. And nothing is happening.

I uploaded a pdf file. And searched for a word that I know is in that file. But it is just not looking inside. And also, will it be possible for me to do a search in a docx file (when I have this working)

PLEASE Help me.

- I've found that if I go to the Specials:FileIndexer page and I try to do a re-index, I get the following error:

Fatal error: Call to undefined method Article::newfromid in /var/lib/mediawiki/extensions/FileIndexer/FileIndexer_body.php on line 131 - Is this the reason that it is not working and how do I fix it?
 * 1.13.x does't have Article::newfromid. Please update your installation to 1.14 --Flominator 18:31, 9 February 2010 (UTC)

Update for MediaWiki 1.15.1 ?
I have tried version v0.2.2.00 with MW 1.15.1 but I did not get it to work. After lots of testing I then tried the historical version for MV >= 1.11.0. After this everything was working. Of course I would like to use the "new features" that were implemented, especially the new "Special Page" to re-index a PDF. Does anyone have a solution or maybe the author is willing to help me (us) ;-) --MKeyler 14:51, 23 February 2010 (UTC)
 * I have change line 245 FileIndexer.php the problem was " " vs. "_" now is gernerats the correct md5-hash

$sFilepath = $wgUploadDirectory. "/" . FileRepo::getHashPathForLevel($oArticle->mTitle->mDbkeyform, 2). $oArticle->mTitle->mDbkeyform ; //              $sFilepath = $wgUploadDirectory. "/" . FileRepo::getHashPathForLevel($oArticle->mTitle->mTextform, 2). $oArticle->mTitle->mTextform;
 * --Swus 05:46, 26 February 2010 (UTC)
 * Great work Swus. Thank you. You were right! Now the "Special Page" works then I enter "File:foo.pdf" or in a German Wiki setup "Datei:foo.pdf".
 * I still had problems when uploading a new file (eg. xls). It did not get indexed automatically. In the FileIndexer.php file I changed these lines.

// $wgFiCreateIndexOnAllUploads = false; $wgFiCreateIndexOnAllUploads = true;
 * Now this is also fixed for me!
 * --MKeyler 13:35, 26 February 2010 (UTC)

Explain Special:FileIndexer page

 * Mediawiki 1.14.1
 * PHP 5.2.6-3ubuntu4.5 (apache2handler)
 * MySQL 5.0.75-0ubuntu10.3

Can someone maybe better explain to me what this means (on the Special:FileIndexer page):


 * On this form you may specify an articletitle : to fill with an index of all words used in an allready uploaded file. The name of the article an the file have to be equal! Even so the namespace may differ from 'image'!

If I put in the following (example of an uploaded file):

File:Uploaded_doc.docx

It says:

The article:

File:Uploaded_doc.docx

And the same as above again.

Which I assume is correct, but now when I do a search, it still doesn't search within the document.

--Johannekie 10:12, 24 February 2010:I'm still having trouble, can't even search in a normal pdf. Johannekie 09:49, 1 March 2010
 * Yes I did. And I'm quite happy. On what platform are you? Linux (which distribution) or Windows? Maybe I can help?!? --MKeyler 14:07, 1 March 2010 (UTC)
 * I just saw your earlier post. So you are on Ubuntu. Same as me. Did you install the following (aptitude install xpdf iconv catdoc antiword)? Then you also have to change all the paths from /usr/local/bin to /usr/bin/ in the FileIndexer_body.php file. After this I also had to change this: Now it should work... --MKeyler 14:12, 1 March 2010 (UTC)
 * Hi, thanks so much. I've been trying to get this to work for a month. But unfortunately after I did the changes you said... Still nothing. Do I need to restart something maybe? I did install those progams and I checked that they are all in /usr/bin. Then I did a reindex on an uploaded pdf. I searched for something inside, but still nothing. So, either I'm doing something wrong or something is still missing.
 * Hit the "edit-Button" after reindexing a pdf file (not a scan of cource). Then look if there was any code inserted... and give feedback. --MKeyler 16:07, 3 March 2010 (UTC)
 * Can you please look at my post on this subject at . It is much easier to have a discussion like this there. Thanks. I'll post there and give results here when the problem is fixed :)


 * I posted something ont the thread above. Please have a look at it.

Johannekie

Only lowercase letters
Is this only in my installation or are all the converted words only in lowercase letters? Is this on purpose or am I doing something wrong? (MW 1.15.1) --MKeyler 14:32, 26 February 2010 (UTC)
 * Hi. After looking to this page after a long time (lots of other work -sorry for all other requests that I didn't answer & thx to all others that helped out!) i will try to answer some requests now.
 * Yes, this is on purpose. Mediawiki internaly searches caseinsensitive. To minimice the text of the article everything is switched to lowercase.
 * --RaZe 08:39, 8 April 2010 (UTC)

Index xlsx, docx, pptx files
Has anyone worked out how to index the MS Office 2007 file types (xlsx, docx, pptx)? I'm currently looking into xls2csv but I can't find anything about xlsx. --MKeyler 15:26, 26 February 2010 (UTC)
 * I think it might be similar to OOo files: Unzip them and take the largest xml file. --Flominator 20:17, 2 March 2010 (UTC)
 * Thank you Flominator for the reply. I unzipped the docx file and voilá: there are lots of xml files. One of them is named "document.xml" and is in the "word" folder. If anyone could implement this extraction process into FileIndexer it would be greatly appreciated as I'm not fit to do this :-( --MKeyler 16:04, 3 March 2010 (UTC)
 * Shouldn't be too complicated (I don't have any testing environment at hand). Something like that code derived from the OOo function should work:


 * Pretty cool. Sounds very good. I will try it tomorrow and will then give feedback. Thank you for your help! --MKeyler 20:00, 3 March 2010 (UTC)
 * Update: you were right! just a little tweak (the name of the directory "word" was missing) and now its working. --MKeyler 15:05, 4 March 2010 (UTC)

Glad to hear that. --Flominator 19:24, 4 March 2010 (UTC)

Problems getting FileIndexer to Work
I was able to get the code FiAutoIndex   on the edit sections of my uploaded files, but when I try to use the native search on mediawiki to look for something i know is in one of the files, it comes up with zero results.

I have all the requirements it asked for in the proper directories (pdftotext, antiword, etc..) so why isn't it searching my uploads? Is this the search engine I should be using? Not sure what I'm doing wrong, this extension is a bit frustrating.

Running Mediawiki 1.15.1 PHP 5.2.6-1+lenny3 (apache2handler) MySQL 5.0.51a-24+lenny2

--Neezy 17:31, 7 April 2010 (UTC)