Extension talk:FileIndexer/Pre.0.4.5.03

From mediawiki.org
The following discussion has been transferred from Meta-Wiki.
Any user names refer to users of that site, who are not necessarily users of MediaWiki.org (even if they share the same username).

Unique Words[edit]

Perhaps for efficiency, it might be worth using a script to attach unique words only to the Wiki as opposed to the file contents. I might give it a go once I learn how.


I had taken a crack at this. here is an example of my hack for PDF's. Should work with other formats:

.
.
$toexec = "/usr/local/bin/pdftotext " . $this->mSavedFile . " - | tr -d ',\"<>/()\:\?\;1234567890!@#$%^&*{}[]+' | tr ' ' ' \n' | sort | uniq -i";
.
.

Works in my 1.8.2.

Catdoc[edit]

I performed the catdoc 0.94 installation, and my found the xls & ppt executables in /usr/local/bin rather than the /usr/bin as shown above (The pdf and word files installed as shown above for me.) --Erik Heidt, 31 July 2005

large files[edit]

Regarding MySQL error "1153: Got a packet bigger than 'max_allowed_packet' bytes"

In using the above patch to upload large PDFs , I encountered the MYSQL error 1153. This error results when you attempt to execute a SQL statement against mySQL which is larger than the system set default. ( For more information see the error explanation on the mySQL website -> MySql Packet-too-large page )

Rather than increase the size of allowable packets, I decided to truncate the text which is returned in $NewDesc to a value large enough that I "probably" get a sample of text for good searches, but small enough that I don't (1) get this error or (2) commit tons of db storage to a single files index text.

After some research I set the value at 512K of text, here is the code I inserted into the MHart patch from above:

    foreach ($DocText as $DocLine) {
      $NewDesc .= "\r\n" . str_replace("-->","",$DocLine);
    }<b>

    # eth: check to see if NewDesc is very large, and truncate it if it is...                       
    $tooLarge = 524828; # 512K                                                                      
    if (strlen($NewDesc)>$tooLarge)
          $NewDesc = substr($NewDesc,0,$tooLarge);
    # eth: end of the large summary change.

    $NewDesc .= "\r\n" . " -->";

MW 1.9.3 Problem[edit]

I have encountered a problem using this extension with MediaWiki 1.9.3 -> no description was sent to SpecialUpload.php and description field was empty. This was resolved by removing "\r\n" when sending it to $NewDesc

$NewDesc .= " " . str_replace("-->","",$DocLine);
...
$NewDesc .= " " . " -->";

--Erik Heidt, 1 September 2005

FileIndexing does not work correct[edit]

I've installed the extension and it works (half).

When I upload a PDF file, some text is beeing inserted in the comments field. It looks correct, but the content is just limited to 255 characters. So when i'm uploading any PDF i'm getting in the comments field something like that:

"<!- - Leitfaden zur Nutzung der MP-Protokolle Seite 1 von 6 05.11.2007 Leitfaden zur Nutzung der MP-Protokolle Das Programm zum Anlegen/Bearbeiten der MP-Protokolle ist in Ferryt an der folgenden Stelle zu finden: Menü +Personal ->MP_Protokoll Abbildung"

The problem is, theres much more text than that. When i'm searching e.g. for "Leitfaden" with the search funciton (with all checkboxes activated) i'm getting nothing. Any idea how to solve this problem?

I was having the same problem, but figured out why it happens. It occurs only when you upload a newer version of an existing file, both text in comments field is limited and searching does not work as you pointed out, the reason being old comments page being preserved. Try uploading that pdf file under a different name. It will be indexed correctly and will be searchable.

Windows 2003[edit]

I had this working with a linux install. I had to move to a 2003 server and the exec paths are wrong or the file permission is wrong or.. windows anyone?
Thanks,
bruceWayne

MW 1.13.1 does it works?[edit]

I tested it and it does not work :(

Has anyone the same problem ?

Greetings Thomas (15.09.2009)

Hi Thomas, what happens? Have you tried Extension:FileIndexer#Debug? --Flominator 16:59, 15 September 2008 (UTC)Reply
I installed it on my 1.13. The stripping of the text, and adding it to the document's comment works. But when I do a search, it just doesn't return anything from there. I even rebuilt the index from the maintenance script manually..
solved

Special Characters[edit]

I've been using the variant above, with a few tweaks.

One thing I noticed was that sometimes pdftotext returns a warning if the pdf files are version 1.6. It still works but I decided it'd be a good idea to redirect stderr and ignore it, see near line 150

    //using XPDF and iconv for conversion purposes
    $toexec = "/usr/bin/pdftotext -raw -nopgbrk " . $uploadFormObj->mTempPath . " - 2> /dev/null";  //modified this line
    $toexec.="| /usr/bin/iconv -f ISO-8859-1 -t UTF-8";

Attaching the index text sometimes fails when extended characters are left in the index text, so I added something to remove that, after line 211:

    $DocLine = eregi_replace("[©°Ã·\|¦?­]|[[:space:]].[[:space:]]", "", $DocLine); //added this line
    // Worte filtern und in Index packen...
    $aSplit = split(" ", $DocLine);

I've not quite got the line numbers right, so I've left in some surrounding lines for context. --Mark P01 22:16, 20 November 2008 (UTC)Reply

pdftotext on windows[edit]

Works well on 1.11.0 windows. had lots of problems with pdftotext (Xpdf), which failed miserably/quietly on many pdfs. pdfbox works on our (various) pdfs in our setup. Also was confused for a little while thinking that only about 230 odd characters were being inserted into the image index; this is not so, it's just the default display for comment field, and you can see whole comment when you edit page. Also changed $wgUploadSizeWarning in LocalSettings.php to a much more reasonable value, as the default image size warning of 150K ends up throwing a CGI error page when you confirm upload.

Can also index SVG[edit]

Since we've also using inkscape to embed SVG images on our windows mediawiki (because IE doesn't support SVG native)..we've been able to use the FileIndexer extension to index our SVG diagrams. The svg files merely require all the tags stripped apart from tspan and text.

pdf upload custom icon/thumb for page?[edit]

i've downloaded ghostscript and can generate a thumbnail of the pdf first page on the command line. can't figure out how to integrate it into this extension, so as to have a nice icon/thumbnail for pdf instead of the ugly adobe logo one.

Theoretically you'd have to do another file upload with the image file, so you can include it into the image description. Another way would be to put the image somewhere and include it as html image tag (of course you'd have to enable the usage of external images first). --Flominator 16:04, 24 January 2009 (UTC)Reply
I don't know enough about mediawiki workings...however i found the code where it normally generates the standard adobe icon (on the fly.) My idea is to add code to display a png version of same name if exists and fallback to the adobe icon, and the fileindexer extension can generate and store the png as well as doing the text extract. I'm having difficulty working out which variables and image directory to use, so as to make it work similarly to other (png etc) images. Or do I just call a mediawiki STORE method? Have to make it nice and easy for everyone, all part of the same/simple file upload operation. -- Chris 26 January 2009.

Question about specialUpload.php[edit]

Please, can yo tell me the exakt position, where to insert your script in "specialUpload.php". I use the wiki in local domain.

thanks casio

Hi Casio, you have to insert it somewhere above these two lines:
function processUpload() {
		/* Check for PHP error if any, requires php 4.2 or newer */

--Flominator 12:34, 27 June 2009 (UTC)Reply

New Variant[edit]

Till today there was my first version of this extension here. As I released a much more advanced version on the main page, I felt that this was only blowing this talk_page up. For people wondering I wanted to leave this note... feel free to visit an earlier revision or please try the current version on the main page. It should do properly for users of the just erased version. --Razqubik 08:33, 2 July 2009 (UTC)Reply

— Feel like posting a link to the old version? because I'm finding loads of bugs in your new version. You need to escape the pipe character in your $sExecutionCommand var for pdfs for example. I cant seem to get anywhere, the $sDocLine in the for loop show up empty. I have tested it with this pdf and it still doesnt work : http://finaid.georgetown.edu/sample.pdf - Thanks, Brendan

Problem with upload FileIndexer.php[edit]

When I upload a file using the importImages.php maintenance script i get these two errors. I have also noticed that path names for the tools are hardcoded ,wouldnt these be better as varibales? On my centOS box the tools go to /usr/bin and the script expects them in /usr/local/bin. Not a big deal, i created symlinks but something to think about. Also can you provide a tar.gz/zip release? It sucks to have to copy and paste, and can cause errors if doing it to a terminal using vim as i found out...

Notice: Undefined variable: Article in /export/home/www/default/wiki/extensions/FileIndexer/FileIndexer.php on line 394 PHP Notice: Trying to get property of non-object in /export/home/www/default/wiki/extensions/FileIndexer/FileIndexer.php on line 394

So far i have not managed to get a single one of my 1400 files to be indexed, depsite having the fileindexer extension set to index every file uploaded. I thought it was because I was using the maintenance script, but it looks like FileIndexer.php is being called. Could be it be with the error shown above? Thanks Brendan

Edit : Looks like the error is because the var $Author should be named $oAuthor....

Thanx for the advice - you are right (even if it is $oArticle ;-) ) - also sorry for beeing absent for that long time --RaZe 08:47, 8 April 2010 (UTC)Reply

Script is running but no index written[edit]

System: Windows 2000 XAMPP 1.7.1 Mediawiki 1.15.0

- Installed pdftotext, iconv, catdoc, catppt. Works on command line. - Switched on $wgFiCreateIndexOnAllUploads = true; - $wgFiRequestIndexCreationFile = "\tmp";

Script is running,

<includeonly><noinclude>FiAutoIndex</noinclude></includeonly> <!-- FI:INDEX-START --><!-- FI:INDEX-ENDE -->

is showing up in the comment area of the uploaded file. If $wgFiCheckSystem is set to true, programs for conversion are found (no error messages for the programs displayed).

Can there be an issue with the paths linux vs. Windows (for example with $sFileHashPath)? How would the path to the conversion programs have to look like?

Any help is appreciated...

Thanks in advance!

- This also happens on Linux in MediaWiki 1.15.1. The old version for MediaWiki 1.11 still works for version 1.15.1.

As mentioned later by someone there are some fixes to my extension to run with 1.15.x . See http://www.mediawiki.org/wiki/Extension_talk:FileIndexer#Update_for_MediaWiki_1.15.1_.3F
--RaZe 08:54, 8 April 2010 (UTC)Reply

Thanks to the writer of the tip above, it worked! Just changed the dirs for the applications to 'D:\xxxxxxxx\pdftotext.exe '... Everything else left as is.

Yes this ist coded for linux - not windows but changing the paths should have helped, though I didnt check if all apps use a corresponding cli. --RaZe 08:54, 8 April 2010 (UTC)Reply

Where did everything go?[edit]

I'm using the Ubuntu 8.04 based Turnkey Mediawiki appliance.

MediaWiki 1.14.0 PHP 5.2.4-2ubuntu5.5 (apache2handler) MySQL 5.0.51a-3ubuntu5.4

I apt-got the prerequisite packages.

I added the line to my LocalSettings.php.

I created the three files in the FileIndexer directory.

I browsed to my main page and I am rewarded with...

A blank default page and no trace that my previous wiki content ever existed.

Help!

I don't know what to say anything other than I do not think this can be caused by this extension... hopefully you figured out the problem allready --RaZe 08:57, 8 April 2010 (UTC)Reply

How do you index existing documents[edit]

Didn't see this on the ext page??

Try going to the page Special:FileIndexer and put the article title in the edit box and click create. It explain how on the page as well. Hope this helps. Johannekie
I will remember to make this more clear on the Extensionpage next time ;-) --RaZe 08:59, 8 April 2010 (UTC)Reply

Searching not working[edit]

  • MediaWiki = 1.13.3
  • PHP = 5.2.6-3ubuntu4.2 (apache2handler)
  • MySQL = 5.0.75-0ubuntu10.2

I created the files as descibed in this extension. And added the include to the LocalSettings.php. I'm not sure If I had to add the part under Historical somewhere, but I didn't. And nothing is happening.

I uploaded a pdf file. And searched for a word that I know is in that file. But it is just not looking inside. And also, will it be possible for me to do a search in a docx file (when I have this working)

PLEASE Help me.

  - I've found that if I go to the Specials:FileIndexer page and I try to do a re-index, I get the following error:
    Fatal error: Call to undefined method Article::newfromid() in /var/lib/mediawiki/extensions/FileIndexer/FileIndexer_body.php on line 131
  
  - Is this the reason that it is not working and how do I fix it?
1.13.x does't have Article::newfromid. Please update your installation to 1.14 --Flominator 18:31, 9 February 2010 (UTC)Reply

Update for MediaWiki 1.15.1 ?[edit]

I have tried version v0.2.2.00 with MW 1.15.1 but I did not get it to work. After lots of testing I then tried the historical version for MV >= 1.11.0. After this everything was working. Of course I would like to use the "new features" that were implemented, especially the new "Special Page" to re-index a PDF. Does anyone have a solution or maybe the author is willing to help me (us) ;-) --SmartK 14:51, 23 February 2010 (UTC)Reply

I have changed line 245 FileIndexer.php the problem was " " vs. "_" now is gernerats the correct md5-hash
                 $sFilepath = $wgUploadDirectory . "/" . FileRepo::getHashPathForLevel($oArticle->mTitle->mDbkeyform , 2) . $oArticle->mTitle->mDbkeyform ;
//               $sFilepath = $wgUploadDirectory . "/" . FileRepo::getHashPathForLevel($oArticle->mTitle->mTextform, 2) . $oArticle->mTitle->mTextform;
--Swus 05:46, 26 February 2010 (UTC)Reply
Great work Swus. Thank you. You were right! Now the "Special Page" works then I enter "File:foo.pdf" or in a German Wiki setup "Datei:foo.pdf".
I still had problems when uploading a new file (eg. xls). It did not get indexed automatically. In the FileIndexer.php file I changed these lines.
//  $wgFiCreateIndexOnAllUploads = false;
    $wgFiCreateIndexOnAllUploads = true;
Now this is also fixed for me!
--SmartK 13:35, 26 February 2010 (UTC)Reply

Explain Special:FileIndexer page[edit]

  • Mediawiki 1.14.1
  • PHP 5.2.6-3ubuntu4.5 (apache2handler)
  • MySQL 5.0.75-0ubuntu10.3

Can someone maybe better explain to me what this means (on the Special:FileIndexer page):

On this form you may specify an articletitle <namespace>:<name> to fill with an index of all words used in an allready uploaded file. The name of the article an the file have to be equal! Even so the namespace may differ from 'image'!

If I put in the following (example of an uploaded file):

File:Uploaded_doc.docx

It says:

The article:

File:Uploaded_doc.docx

And the same as above again.

Which I assume is correct, but now when I do a search, it still doesn't search within the document.

--Johannekie 10:12, 24 February 2010:I'm still having trouble, can't even search in a normal pdf. Johannekie 09:49, 1 March 2010

Yes I did. And I'm quite happy. On what platform are you? Linux (which distribution) or Windows? Maybe I can help?!? --SmartK] 14:07, 1 March 2010 (UTC)Reply
I just saw your earlier post. So you are on Ubuntu. Same as me. Did you install the following (aptitude install xpdf iconv catdoc antiword)? Then you also have to change all the paths from /usr/local/bin to /usr/bin/ in the FileIndexer_body.php file. After this I also had to change this: [1] Now it should work... --SmartK 14:12, 1 March 2010 (UTC)Reply
Hi, thanks so much. I've been trying to get this to work for a month. But unfortunately after I did the changes you said... Still nothing. Do I need to restart something maybe? I did install those progams and I checked that they are all in /usr/bin. Then I did a reindex on an uploaded pdf. I searched for something inside, but still nothing. So, either I'm doing something wrong or something is still missing.
Hit the "edit-Button" after reindexing a pdf file (not a scan of cource). Then look if there was any code inserted... and give feedback. --SmartK 16:07, 3 March 2010 (UTC)Reply
Can you please look at my post on this subject at [2]. It is much easier to have a discussion like this there. Thanks. I'll post there and give results here when the problem is fixed :)


I posted something ont the thread above. Please have a look at it.

Johannekie

Only lowercase letters[edit]

Is this only in my installation or are all the converted words only in lowercase letters? Is this on purpose or am I doing something wrong? (MW 1.15.1) --SmartK 14:32, 26 February 2010 (UTC)Reply

Hi. After looking to this page after a long time (lots of other work -sorry for all other requests that I didn't answer & thx to all others that helped out!) i will try to answer some requests now.
Yes, this is on purpose. Mediawiki internaly searches caseinsensitive. To minimice the text of the article everything is switched to lowercase.
--RaZe 08:39, 8 April 2010 (UTC)Reply
Thank you... great to know! Is there a "Switch" where I could turn this off? --SmartK 09:53, 8 April 2010 (UTC)Reply
No there isn't sorry. but a if you give me a reason I will implement it in the next version (although only god knows when I will have the time to...)--RaZe 14:22, 17 May 2010 (UTC)Reply
I have no reason so far as the MediaWiki search algorithm does not care about capital letters. But that could change at some point later on. So why not ;-) --SmartK 10:02, 28 July 2010 (UTC)Reply

Index xlsx, docx, pptx files[edit]

Has anyone worked out how to index the MS Office 2007 file types (xlsx, docx, pptx)? I'm currently looking into xls2csv but I can't find anything about xlsx. --SmartK 15:26, 26 February 2010 (UTC)Reply

I think it might be similar to OOo files: Unzip them and take the largest xml file. --Flominator 20:17, 2 March 2010 (UTC)Reply
Thank you Flominator for the reply. I unzipped the docx file and voilá: there are lots of xml files. One of them is named "document.xml" and is in the "word" folder. If anyone could implement this extraction process into FileIndexer it would be greatly appreciated as I'm not fit to do this :-( ----SmartK 16:04, 3 March 2010 (UTC)Reply
Shouldn't be too complicated (I don't have any testing environment at hand). Something like that code derived from the OOo function should work:
case "xlsx":
case "pptx":
case "docx":{
			$sExecutionCommand = "unzip -p " . $sFileHashPath . " document.xml";
			$bRemoveTags = true;
			break;
Pretty cool. Sounds very good. I will try it tomorrow and will then give feedback. Thank you for your help! --SmartK 20:00, 3 March 2010 (UTC)Reply
Update: you were right! just a little tweak (the name of the directory "word" was missing) and now its working. --SmartK 15:05, 4 March 2010 (UTC)Reply
case "docx":{
			$sExecutionCommand = "unzip -p " . $sFileHashPath . " word/document.xml";
			$bRemoveTags = true;
			break;
		}

Glad to hear that. --Flominator 19:24, 4 March 2010 (UTC)Reply

With your permission i will take that code into the next version, Flominator? I think i will come to an update in the next weeks (hope so) --RaZe 09:03, 8 April 2010 (UTC)Reply
Dear RaZe. I will gladly provide the code for all Office 2007 files. If you want them that would be no problem. I also enhanced the "German" language part as there were many mistakes in it if you are interested. --SmartK 09:52, 8 April 2010 (UTC)Reply
Dear SmartK. This would be very nice. Please leave a note on this page. Hopefully I will have some ressources for this extention in the near future.--RaZe 14:10, 17 May 2010 (UTC)Reply
This is the code for Office 2007 / 2010 Documents which has to be implemented into "FileIndexer.php" --SmartK 09:56, 28 July 2010 (UTC)Reply
    case "docx":{
			$sExecutionCommand = "unzip -p " . $sFileHashPath . " word/document.xml";
			$bRemoveTags = true;
			break;
		}
		
		case "xlsx":{
			$sExecutionCommand = "unzip -p " . $sFileHashPath . " xl/sharedStrings.xml";
			$bRemoveTags = true;
			break;
		}
			
    case "pptx":{
    	$sExecutionCommand = "unzip -p " . $sFileHashPath . " ppt/slides/slide*.xml";
			$bRemoveTags = true;
			break;
And this the the slightly improved code for the German translation for the "FileIndexer.i18n.php" file --SmartK 09:59, 28 July 2010 (UTC)Reply
	$messages['de'] = array(
		'fileindexer' => 'FineIndexer: Index einer Datei erstellen',
		'fileindexer_wrong_namespace' => 'Der angegebene Namensraum ist nicht zulässig!',
		'fileindexer_destination' => 'Titel des zu indexierenden Artikels:',
		'fileindexer_comment' => "In diesem Formular kann ein Artikeltitel <Namensraum>:<Name> angegeben werden (z.B. Datei:indexiere-mich.pdf). Die Datei wird dann indexiert und ist somit durchsuchbar.",
		'fileindexer_no_params' => 'Sie müssen einen gültigen Artikeltitel &lt;Namensraum&gt;:&lt;Name&gt; angeben und die zu indexierende Datei muss unter diesem Namen als hochgeladene Datei vorliegen!',
		'fileindexer_articlelink' => 'Indexierung abgeschlossen... Hier geht es zum Artikel:',
		'fileindexer_systemcheck_with_errors' => 'Eine Prüfung der Systemvoraussetzungen lieferte Fehler!',
		'fileindexer_submit_button' => 'Index erstellen'
	);

Problems getting FileIndexer to Work[edit]

I was able to get the code <includeonly><noinclude>FiAutoIndex</noinclude></includeonly> <!-- FI:INDEX-START --><!-- FI:INDEX-ENDE --> on the edit sections of my uploaded files, but when I try to use the native search on mediawiki to look for something i know is in one of the files, it comes up with zero results.

I have all the requirements it asked for in the proper directories (pdftotext, antiword, etc..) so why isn't it searching my uploads? Is this the search engine I should be using? Not sure what I'm doing wrong, this extension is a bit frustrating.

Running Mediawiki 1.15.1 PHP 5.2.6-1+lenny3 (apache2handler) MySQL 5.0.51a-24+lenny2

--Neezy 17:31, 7 April 2010 (UTC)Reply

Sorry but the following questions are not clearly answered:
  • Did you follow http://www.mediawiki.org/wiki/Extension_talk:FileIndexer#Update_for_MediaWiki_1.15.1_.3F ? (I have no 1.15 running till now but i am going to soon.)
  • By saying "I was able to get the code <includeonly><noinclude>FiAutoIndex</noinclude></includeonly> <!-- FI:INDEX-START --><!-- FI:INDEX-ENDE --> on the edit sections of my uploaded files" do you mean using the extension or adding it manually?
  • Did you try to (re)index a file through the specialpage?
Just to be clear: this extension does not search the files on demand but on uploadtime (or when using the specialpage). When an index is created all found words will stand betreen these tags: <!-- FI:INDEX-START --> and <!-- FI:INDEX-ENDE -->. As there is nothing in it mediawiki is not able to find the word you have tryed to get results on...
--RaZe 09:21, 8 April 2010 (UTC)Reply

FileIndexer not working with MediaWiki 1.16.0[edit]

  • I had already tried the betas of MW Version 1.16 but no luck so far. But now as the final MW 1.16 is released I was wondering if someone could look into the problem. When uploading any file I just get an empty screen. No error messages. --SmartK 09:51, 28 July 2010 (UTC)Reply
    • PHP Fatal error: Cannot access protected property SpecialUpload::$mComment in /var/www/html/w/extensions/FileIndexer/FileIndexer.php on line 307


To let everyone know: I think it will please you that I am really short to a totaly new version of this extension with a much improved spezialpage and much more. It will also run on mw 1.16 . SmartK is beta-testing it right now. Greetings --RaZe 14:29, 30 September 2010 (UTC)Reply

The problem should be fixed with this patch. --Flominator 13:30, 22 December 2010 (UTC)Reply

Index previously uploaded documents[edit]

  • Is there a way to index "ALL" my previously uploaded documents (before installing fileindexer) at once? Let's say I want I index all my PDFs. I know I can do it one by one via the "Special:FileIndexer page". But maybe someone has already written a script? This would be highly appreciated. --SmartK 15:07, 2 August 2010 (UTC)Reply


Comming with the new Version (as you allready know :D) --RaZe 14:30, 30 September 2010 (UTC)Reply

Problem with .doc files and special characters (antiword)[edit]

  • When you index .doc/ .dot files with special characters often you just get "?" instead of those special characters.
    • Solution: add this parameter to the FileIndexer.php file in the line with "antiword": -m UTF-8.txt
    • So it should look like this:
case "dot":{}
case "doc":{
	$sExecutionCommand = "/usr/bin/antiword -m UTF-8.txt -s ".$sFileHashPath;
	break;
	}

--SmartK 14:00, 6 August 2010 (UTC)Reply

New version takes that into account, too. --RaZe 14:31, 30 September 2010 (UTC)Reply