Extension talk:FileIndexer

From mediawiki.org

Discussions Pre Version 0.4.5.03[edit]

Hi everyone.

To make maintenance work to this extension a bit easier for me, I decided to clear this talkpage.
All discussions from before this version are placed in a subpage: Discussions Pre Version 0.4.5.03.

So... party on.
--RaZe 16:21, 3 October 2010 (UTC)Reply

German signs make code unreadable + File contents not ending with ?>[edit]

Hi, thanks for this great extension! I had some trouble getting it to work, because at first I hadn't noticed the 4 pages with the file contents did not end the code with ?>. Second, the line for German only characters caused php to generate an error, apparently the characters were converted to something unreadable by php, after pasting the content in vi. I had to delete that whole part. -Bob

Hi Bob. To be honest, I myself haven't seen PHP code not ending with the closing tag till mediawiki... even now this is the only place where I use this feature cause I felt it was normal for mediawiki developers as I found more and more extensions doing this.
About the german signs: I will try to find a better way soon/next time. To let others know I will add this to the topic.
--RaZe 00:03, 20 November 2010 (UTC)Reply
It is apart of our Coding_Conventions#PHP_pitfalls because it can cause issues when they are included, and we don't loose anything (benefit wise) if they aren't. Peachey88 00:54, 20 November 2010 (UTC)Reply

Getting no index[edit]

Hi I'm running version 0.4.5.03 on MW 1.16.2 and PHP 5.3.3 on Ubuntu 10.10. I've carefully followed all the installation instructions and installed all the required tools.
The SpecialPage all works and it gives a list of articles for which the index update process was started. However, when I search for a word that I know is in one of the documents nothing is found. If I go to a file page it shows:

:--[[User:Mitchelln|Mitchelln]] 11:32, 21 April 2011 (UTC)
File Index

The following index was taken from the files content:

{{{index}}}

Any ideas? How can I debug this?
Many thanks. :--Mitchelln 11:32, 21 April 2011 (UTC)Reply

Fixed. Of course you need to remove the pre /pre from the FileIndex Template
-Mitchelln 11:32, 21 April 2011 (UTC)Reply

Anyone have this working with MediaWiki 1.17.0 and PHP 5.3.6?[edit]

MediaWiki 1.17.0 PHP 5.3.6 (cgi-fcgi) MySQL 5.1.48-log

Running with...[edit]

Software instalado

  • MediaWiki 1.17.0
  • PHP 5.3.2 (apache2handler)
  • MySQL 5.1.52

Extensiones instaladas

  • FCKeditor (Versión 1.0.1) Permitir edición usando el editor WYSIWYG FCKeditor Frederico Caldeira Knabben, Wiktor Walc, others y Jack Phoenix
  • FileIndexer (Versión 0.4.5.03) Index-Erzeugung aus hochgeladenen Dateien zur Erfassung durch Suchfunktionen Ramon Dohle (raZe) | Original: MHart and Flominator
  • poppler-utils

Missing Checkbox "FileIndexer: [ ] Create/update index"[edit]

I'm missing the checkbox "FileIndexer: [ ] Create/update index" afer installing FileIndexer. It is shon on the page "Special:Upload" but not on "Special:UploadWindow". How can I make the checkbox appear on "Special:UploadWindow"? - stevewilson 11:45, 18 November 2011 (UTC)

No index...[edit]

Added FileIndexer to my MediaWiki installation. Uploaded a .docx file, and tried to search for some text that was inside the file... and I got nothing.

Also, when I try to upload a .doc file, it says MIME type doesn't match file extension!?

Epistasis 16:20, 24 November 2011 (UTC)Reply

Any websites running FileIndexer? (22dec2011)[edit]

Hi everybody, this extension seems a bit difficult for me to install. I love the idea of being able to search within pdfs, but don't want to expend all of the effort if the extension isn't working. Does anybody have a link to a wiki I can see where it is working? Thanks in advance

Hi, you are right. It's not that easy to install but once it's working it is definitely worth it. I only have it running on 3 internal Wikis. (1.16.and 1.17). FileIndexer works great there! --SmartK 07:09, 23 December 2011 (UTC)Reply
Any chance you can provide a link to any of the wikis so that I could get a sense of how it works?

Word documents not working[edit]

I've tried to upload and index .odt, .pdf and .doc-Files. The .odt and .pdf files get an index, the .doc file doesn't. Does anyone has a solution? - stevewilson 15:58, 29 December 2011 (UTC)

Have you installed "antiword"? And it is set up correctly? e.g. "usr/local/bin/antiword" --SmartK 16:44, 29 December 2011 (UTC)Reply

Any Windows file reader?[edit]

This is obviously designed with Linux in mind. But there are still many out there who prefer to use Windows-based web servers. So my question is, does any know if there are any Windows file readers for the file types listed in the article? Jamesjiao 22:49, 31 January 2012 (UTC)Reply

Hi Jamesjiao, sorry but I havn't researched this but I am sure there are.
Right now i am not really sure if there was anything in the code that would prevent this extension from running on windows. Except obviesly the configuration. May be I will take a look at this when I will try to fix the incompatibility issue with 1.18.x of mediawiki. --RaZe (talk) 12:36, 24 February 2012 (UTC)Reply

FileIndexer fails in MediaWiki 1.18.1 :-( --> NOW working with update[edit]

  • This is the error code I get with 1.18.1:
"Fatal error: Call to a member function addMessages() on a non-object in /var/www/xyzwiki/extensions/FileIndexer/FileIndexer.php on line 133"
Hi SmartK,
Swus asked for help on my talkpage allready. As I told him, I will try that out as soon as possible. Sorry I can't say any more right now.--RaZe (talk) 12:31, 24 February 2012 (UTC)Reply
Thank you RaZe. We should remove the phrase from the "extension page": "The author of this extension is no longer maintaining it!" I think it's great that you are still trying to help.... and I know it's a lot of work! --SmartK (talk) 13:47, 24 February 2012 (UTC)Reply
SmartK did release a new version that works with 1.18.1 (i have not tested it but you may try out now).

Does this extension get all of the text from a pdf?[edit]

  • This extension seems great and I had no trouble installing it.
  • However, I notice that only some of the words in the pdf I upload go into the wiki. If I take the same file and use "pdftotext name.pdf text.txt" in the commandline, I get a text file with the full content of the pdf. But in the wiki, many words are missing. Is there a switch I'm missing?
  • Ideally, I'd want to just dump in the full-text of a pdf the same way it comes out when I run pdftotext by itself. Is that possible? In case it helps, I am also trying to start with a file with charset=binary. Would that potentially be causing a problem?
    • I looked in the source code but could not find this option yet. But I think it's a good idea to implement this. Maybe raZe can help?!? His idea was at that time that an "index" does only need the word once to be searched but I understand your idea. So let's hope he is willing to put some time into this.
    • You should change the following in the file "FileIndexer_cfg.php". This will not help to solve your problem but you will need this anyway.
$wgFiMinWordLen = 1;
$wgFiLowercaseIndex = false;
Yea, I did figure that out. I'd like to get around the removal of duplicate words though. Also, I'd rather preserve the formatting of the original document (as much as possible!). Since the default pdftotext option does exactly what I want, there has to be a way where I could just comment out a lot of the special indexing and just put the text file in directly.
I really hope I can find a way to fix this.
Hi, if I get you right the following is what you want.
Fragment of codefile FileIndexer.php:
 function wfFiGetIndex($sFileHashPath){

~~~ CUT ~~~

        foreach ($sDocText as $sDocLine){
            if(in_array($sFileExtension, $wgFiTypesToRemoveTags)){
                // Tags entfernen... Vorher vor jedem "<" Leerzeichen einfuegen, damit keine Worte zusammenfallen!
                $sDocLine = strip_tags(str_replace("<", " <", $sDocLine));
            }

            // *** ADD THIS 1ST SHORT BLOCK IF YOU WANT THE FULL PDF CONTENT AS INDEX
            if ($sFileExtension = "pdf"){
                $sReturn .= $sDocLine;
                continue;
            }
            // *** END OF 1ST BLOCK
 
~~~CUT~~~
        }

        // *** ADD THIS SHORT 2ND BLOCK IF YOU WANT THE FULL PDF CONTENT AS INDEX
        if ($sFileExtension = "pdf"){
            return $sReturn . $wgFiPostfix;
        }
        // *** END OF 2ND BLOCK

        // Index global setzen...
        foreach(array_keys($aIndex) as $skeyword){
            $sReturn .= $skeyword . " ";
        }
 
        $sReturn .= $wgFiPostfix;
    }
 
    return $sReturn;
}
Note that this solution limits your requirements to pdf files... feel free to add other filetypes.
Right now I don't have resources to test this, so if I have typos I hope you may eliminate them. In general I think that should do it.
May be I will implement this with an option (by filetype) in the next version (when ever that comes).
Regards --RaZe (talk) 11:07, 20 April 2012 (UTC)Reply
Thank you RaZe for your fast answer. I also think it would be great to add this option as a variable in the config file (in the future). Let's hope it works and Mr. Anonymous is now happy ;-) --SmartK (talk) 11:18, 20 April 2012 (UTC)Reply
Yes, thanks so much RaZe. I can't thank you enough for your kindness. I, Mr. Anonymous/too lazy to reset password am very very happy now! :)

Cant get my index to work[edit]

Hey i hope i can get some help, with this problem. I've got the extension to work, i can select the speciel page and, select the files. When i select "Main" namespace and say "Create" i get this message: "For the following list of articles the index creation process was started:" After that i get a new page created with the file name (test.pdf) and a index, but the index is emty and there is no file on the page, and i am not able to search for anything within the pdf file.

if i could get some help i would be very happy :)

regards Mikkel

Hi Mikkel, there is no file on this page. Thats correct. I dont link files to the created pages, these are still only linked in the images namespace (or files namespace, which is an alias).
That you reach the spezialpage does only say that the extensions code is reachable. One Problem (most certain) can be that the required tool are not reachable (see Extension:FileIndexer#Requirements) May be you have to adjust $wgFiCommandPaths in the file FileIndexer_cfg.php
Did you try to set $wgFiCheckSystem = true in file FileIndexer_cfg.php? Please send me the result.
Did you follow all instructions from the install description? --RaZe (talk) 14:18, 7 May 2012 (UTC)Reply
Hi i got all the other programs installed and working, and they should be right in CommandPaths.
can you tell me where i read the output from "$wgFiCheckSystem"?
Hi, the output is written in the specialpage when you call it and $wgFiCheckSystem is set to true. But only if the command is not reachable - what I am not really sure about right now is, if this really works as intended (didnt really test this). I will asap check if the which command really does what i expect it to do... I will be back on this --RaZe (talk) 11:53, 15 May 2012 (UTC)Reply
Hi thats sounds good,, i really hope you can help me, getting this plugin to work. ¨
Mikkel 14:00

This extension changes my layout when in edit model[edit]

i like this extension which can index docs,some docx ppt pdf and excel(i do not kown why it is not working on some files, may be the /usr/bin/xxx's usage has different syntax?). The biggest problem is when i use this extension, my layout changes when in edit model.like this: wrong layout

but when i ignored this extension in LocalSettings.php, things turn normal. like this: good_layout

Hi, never heared of that problem (layoutchanges) and i myself didnt expirienced it so far... i will have to look into the code for this if there are some html bugs (this may be)
but for your 2nd problem (some docs not indexed): this might be a problem with mime-type recognition. I have a small hack/fix for this posted earlier (though i dont really know it this is still the actual problem because i still run an ols wiki version). Look on my user page for the link --RaZe (talk) 12:09, 15 May 2012 (UTC)Reply
Me again, i have three questions:
  1. Is this the upload form or the normal edit form that is broken?
  2. What skin you use of mediawiki?
  3. May be you can link a html file from the pages source code where the layout is broken? --RaZe (talk) 12:18, 15 May 2012 (UTC)Reply

PHP Notice: Undefined property: LocalRepo::$directory in FileIndexer.php on line 263[edit]

PHP Notice: Undefined property: LocalRepo::$directory in /var/www/html/wiki/extensions/FileIndexer/FileIndexer.php on line 263, referer: http://host/wiki/index.php/Special:Upload

Line 263: $sUploadedFilepath = $oImage->getLocalFile()->repo->directory . "/" . $oImage->getLocalFile()->hashPath . $oImage->getLocalFile()->name;
Does anybody know how to solve it?

Security warning?[edit]

Has there been any talk regarding addressing the code-injection potential of this extension?

What exactly is the code-injection potential of this extension? I can't see any other comment about it on this page, and the main page has been replaced with a "Extension Removed" page, without a history. Is the issue that PDF files could contain code, which is then executed somewhere? What is required to address the security risk? Alternatively, is there another extension that indexes PDFs, DOCs, etc.? Ismarandir (talk) 10:35, 24 June 2013 (UTC)Reply
Can someone please explain the security issue, or link to the explanation?? --Sophivorus (talk) 00:20, 14 November 2014 (UTC)Reply
I've uploaded the updated extension to https://github.com/Sophivorus/FileIndexer, but the safety issues haven't been solved, as no one explains what they are, and there seems to be no relevant bug report in bugzilla. --Sophivorus (talk) 14:07, 16 November 2014 (UTC)Reply

Call to protected method FileRepo::getHashPathForLevel() from context ''[edit]

  • I'm receiving "Call to protected method FileRepo::getHashPathForLevel() from context" when attempting to upload and create the indexes of files. Currently using 1.20.2. Has something changed in the recent mediawiki version to prevent this function from being called from an extension?
This is how I solved it:
Change in FileIndexer.php (switch out the first line for the second one)
//$sFilepath = $wgUploadDirectory . "/" . FileRepo::getHashPathForLevel($oArticle->mTitle->mDbkeyform , $wgHashedUploadDirectory ? 2 : 0) . $oArticle->mTitle->mDbkeyform;
$sFilepath = $wgUploadDirectory . "/" . intGetHashPathForLevel($oArticle->mTitle->mDbkeyform , $wgHashedUploadDirectory ? 2 : 0) . $oArticle->mTitle->mDbkeyform;
Hope it helpes...Let me know. --SmartK (talk) 08:24, 30 January 2013 (UTC)Reply
Update: I updated the FileIndexer.php here in the "downloads".
I am still getting this error
  • Fatal error: Call to undefined function intGetHashPathForLevel() in /var/www/html/w/extensions/FileIndexer/FileIndexer.php on line 314
  • Getting the new file fixed the original problem, but caused the one above.

No index generated for existing files[edit]

Greetings,

After a while I've got the extension to work, it's great! When I upload a new file the index is saved in table searchindex, I can find even words with german characters. But it seems the index is not generated when a new version of an existing file is uploaded, are you aware of this issue? From my point of view this functionality is very important.
In order to create a template as in your installation document step 6, I used the link http://localhost/wiki/index.php/Template:FileIndex then I pasted the template content and saved it. But after I upload a file, the place where the index should be is empty, I get only the text message "The following index was taken from the files content:". Where am I wrong?

Would be possible to get your help? Thanks in advance. Feb 1 12:50 2013 (CET)

FileIndexer 0.4.6.01 / MediaWiki 1.15.5-7 / PHP 5.3.10-1ubuntu3.5 (apache2handler) / MySQL 5.5.29-0ubuntu0.12.04.1

Could you reproduce the issue and if affirmative, is there any plan to fix it? Otherwise I have to look for an alternative solution.
Please answer asap. Feb 14 14:37 2013 (CET)

Extension disabled for safety reasons !?[edit]

Safety ok, I understand, but I do not understand why not permit who want to use this extension!!! In these conditions, nobody will improve safety of this extension.

It looks like what I was looking for ....
I still use it but it does sadly NOT WORK on MW 1.23 :-( --SmartK (talk) 13:30, 9 July 2014 (UTC)Reply
I've uploaded the updated extension to https://github.com/Sophivorus/FileIndexer, but the safety issues haven't been solved, as no one explains what they are, and there seems to be no relevant bug report in bugzilla. --Sophivorus (talk) 14:06, 16 November 2014 (UTC)Reply

Resubmit[edit]

We have updated the FileIndexer extension to work with MediaWiki 1.23.x and would like to resubmit it here. Would this be OK? The security risks should now be solved. --SmartK (talk) 09:32, 7 August 2014 (UTC)Reply

"should now be solved" Are they solved, or not? :) Maybe you can upload the extension somewhere (github maybe?), so some developers can take a view on it. Maybe you can request assistant in #wikimedia-dev connect. --Florianschmidtwelzow (talk) 22:07, 18 October 2014 (UTC)Reply
Can someone explain what is the security issue with this extension, or link to the explanation?? --Sophivorus (talk) 00:19, 14 November 2014 (UTC)Reply
I've uploaded the updated extension to https://github.com/Sophivorus/FileIndexer, but the safety issues haven't been solved, as no one explains what they are, and there seems to be no relevant bug report in bugzilla. --Sophivorus (talk) 14:06, 16 November 2014 (UTC)Reply

Security[edit]

The security issues arise from the fact, that in several places the "exec" command is used. And the exact command that is executed, is joined from several variables that are defined and modified throughout the code. That way it is not possible to properly escape the command and the arguments. So the risk that some malicious code is executed is rather high. For example think about uploading a document that is named "nothing | rm *.xls". When the file is passed to the exec command (and is not properly escaped), files maybe get deleted (I didn't try it, but I think you get the idea).

The dependencies for the third-party-programs are a bit complicated, so I will use "Apache Tika" in the future to extract data from files. --Markus Kappe / DIX web.solutions; 28 August 2015

Thank you Markus for the explanation. Is there anyone out there who is willing to help us "fix" / "improve" this extension here (https://github.com/Sophivorus/FileIndexer) so we can get rid of the security issues and can use is again? --SmartK (talk) 06:54, 22 October 2015 (UTC)Reply

Is there any extension FileIndexer replacement ?[edit]

I'm an old user of FileIndexer extension, it worked very well for me. I read that someone worked (or tried to) on it in order to remove the declared security issue up to 2015. After that, the silence, the extension was totally abandoned from everybody. So I'm wondering if someone could tell me if there is a more recent other extension that can replace it and is capable to index loaded files. I saw something similar with CirrusSearch extension, but it seem provide only pdf indexing while I need to index pdf, doc, docx, xls, xlsx and ppt file. Thanks a lot in advance. Bye

We still use: https://github.com/Sophivorus/FileIndexer SmartK (talk) 09:34, 27 November 2017 (UTC)Reply