Extension:FileIndexer

From MediaWiki.org
Jump to: navigation, search
MediaWiki extensions manual - list
Crystal Clear action run.png
FileIndexer

Release status: beta

Implementation Special page
Description This extension makes uploaded document files searchable.
Author(s) MHart, Flominator, Hxwiki, raZe (RaZeTalk)
Last version 0.4.5.03
MediaWiki New version: 1.14 +
Flominators version: 1.9.0 - 1.11
(Versions before require patching)
License No license specified
Download No link
See section 'Changelog'
Parameters

see section 'Configuration'

Hooks used
EditPage::showEditForm:initial

UploadForm:initial
UploadForm:BeforeProcessing
ArticleSave
UploadComplete

Check usage (experimental)

Contents

[edit] History

MHart modified the standard upload page so that uploaded Microsoft Word, Microsoft Excel, Microsoft PowerPoint, and Adobe PDF documents will have their contents indexable.

He started by downloading and installing various Linux command line utilities that will take one of the above formats and output the text - antiword, xls2csv, ppt2text (catppt), and pdftotext.

Then he modified SpecialUpload.php where it tests for a successful upload and just before it inserts the uploaded file information into the database. What it did was make the text of the word document an HTML comment block in the description text of the image's file page.

Two years later Flominator found the hook UploadForm:BeforeProcessing and created an extension out of it. Then Hxwiki came and modified the code to work with MediaWiki 1.11. See section Historical Versions

Until MediaWiki version 1.8 (2006) the extension required patches to core MediaWiki code (if you would like to use this extension, we strongly recommend you upgrade to a version of mediawiki later than 1.8).

In May 2008 I (raZe) published a new, much more advanced version of this extension as a complete rewrite.

[edit] Compatibility with MediaWiki

 1.11 untested         | noone
 1.12 untested         | noone
 1.13 fails            | Johannekie | 23.02.2010
 1.14.x running 100%   | raZe       | 29.06.2009
 1.15.x running 100%   | raZe       | 03.10.2010
 1.16.0-5 running 100% | SmartK     | 16.08.2011
 1.17.0 running 100%   | SmartK     | 20.10.2011
 1.18.0 fails          | Swus       | 29.11.2011

[edit] Requirements

  • The extension uses different external tools to read the content of supported filetypes. The default configuration requires the following Linux tools installed on the server:
  • If you want to use this extension on a Windows server, then you need to find and configure corresponding tools.

[edit] Installation

Following steps are needed for installation:

[edit] Step 1: Prepare Extension Directory

Create a folder 'extensions/FileIndexer' in MediaWikis documentroot.

[edit] Step 2: Copy Code

Create all (actually four) codefiles inside this new folder and make sure the webserver can read these files.

Files:

[edit] Step 3: Configuration

Open file FileIndexer_cfg.php with an editor and configurate the extension for your needs. See section Configuration for detail.

[edit] Step 4: Temporary Files Directory

Make sure the webserver has writeaccess to the directory configurated in parameter '$wgFiRequestIndexCreationFile'.

[edit] Step 5. LocalSettings.php

# Makes uploaded documents searchable
include("$IP/extensions/FileIndexer/FileIndexer.php");
  • Also, by default, the NS_IMAGE namespace is not searched:
$wgNamespacesToBeSearchedDefault = array(
        NS_MAIN =>           true,
        NS_TALK =>           false,
        NS_USER =>           false,
        NS_USER_TALK =>      false,
        NS_PROJECT =>        false,
        NS_PROJECT_TALK =>   false,
        NS_IMAGE =>          true,
        NS_IMAGE_TALK =>     false,
        NS_MEDIAWIKI =>      false,
        NS_MEDIAWIKI_TALK => false,
        NS_TEMPLATE =>       false,
        NS_TEMPLATE_TALK =>  false,
        NS_HELP =>           false,
        NS_HELP_TALK =>      false,
        NS_CATEGORY =>       false,
        NS_CATEGORY_TALK =>  false
);
  • While new users will inherit the above settings, if there are pre-existing users, it's necessary to update each of their userOptions by (NS_IMAGE == searchNs6):
/path/to/wiki# php maintenance/userOptions.php searchNs6 --new 1 --old ''

[edit] Step 6: Template:FileIndex

NOTE: This is just needed if you use the template 'FileIndex' in configuration parameters $wgFiPrefix/$wgFiPostfix (see section 'Configuration')

Create a template [[Template:FileIndex]] that fits your needs for the output of indexes.

The following is just a simple example that can be changed any time later:

== File Index ==
The following index was taken from the files content:
<!-- {{{index}}} -->

[edit] Configuration

The following switches and parameters are implemented:

Parametername Type Developers Default Description
$wgFiCheckSystem BOOL FALSE If TRUE system will be checked each time the specialpage is called or an index creation is started.

NOTE: if you install this extension for the first time you should set this option to TRUE to check if all external tools are reachable.

$wgFiCommandPaths ARRAY see configuration file Maps the fully qualified callpaths of all external tools to a short name.
$wgFiCommandCalls ARRAY see configuration file Maps file extentions to a comandline template. A template uses the constant WC_FI_FILEPATH for the path to the file to be indexed. The external tools are referenced by using the constant WC_FI_COMMAND, followed by an opening '[', the mapped name for the fully qualified callpath (see Parameter $wgFiCommandPaths) and a closing ']'.

Example:

'odt' => WC_FI_COMMAND . "[unzip] -p \"" . WC_FI_FILEPATH . "\" content.xml"
$wgFiTypesToRemoveTags ARRAY see configuration file Lists all file extensions (filetypes) that use tags like xmlfiles. These files will be cleared from tags.
$wgFiRequestIndexCreationFile STRING "/tmp" This is a required path to a systemdirectory where the webserver has writeaccess. It is used to leave a note during uploads when an index shall be created. Otherwise any form of warning during the upload dialog would result in no index creation.
$wgFiPrefix STRING "<!-- FI:INDEX-START -->{{FileIndex |index=" Unique string to mark the head of the indexblock which is needed to actualize automaticaly. Additionaly it formats the output of the index. New with version 0.4.5.03 it makes use of a template 'FileIndex'. This template should be used to format the output of an index finally.

Caution! Caution: If you used this extension before this release already, make sure to set this parameter to the value you used earlier or make sure your old indexes will get updated with this sign! Otherwise this extension will not be able to find your old indexes!

$wgFiPostfix STRING " }}<!-- FI:INDEX-ENDE -->" Unique string to mark the tail of the indexblock which is needed to actualize automaticaly. New with version 0.4.5.03 it closes the template 'FileIndex'. (see also parameter $wgFiPrefix)

Caution! Caution: If you used this extension before this release already, make sure to set this parameter to the value you used earlier or make sure your old indexes will get updated with this sign! Otherwise this extension will not be able to find your old indexes!

$wgFiArticleNamespace INT NS_IMAGE Sets the namespace to place indexes on file uploads with index creation. For example if this is set to the mainnamespace and you upload a file X with index creation, the index will be saved in article X in the mainnamespace.

The configured namespace is also the default namespace selected in the special page.

The namespace is specified by ints number, not its name!

NOTICE: If you use the namespace NS_IMAGES ('File:') for your indexes, make sure you configure your wiki to search this namespace, too.

$wgFiMinWordLen INT 3 As the filtering algorithms are very basic till now by this value you may at least specify a minimum length a string must have to be registered in the index. Values lower than 1 are switched to value 3.
$wgFiLowercaseIndex BOOL TRUE Switch to decide if all words of the index shall be lowercased or be left as in original (which results in bigger indexes in general).
$wgFiSpDefaultWildcardSign CHAR "*" Sets the wildcard sign that can be used in the special page to filter files.
$wgFiSpWildcardSignChangeable BOOL TRUE If FALSE the wildcard sign on the special page may not be changed per request.
$wgFiSpNamespaceChangeable BOOL TRUE If FALSE the destination namespace on the special page may not be changed per request and indexes will only be createable in the namespace specified in $wgFiArticleNamespace (and only this namespaces will be searched for indexes in 'check mode').
$wgFiCreateOnUploadByDefault BOOL TRUE Switch to determine if the checkbox in the uploadform to create/update an index shall be set in general at first.
$wgFiUpdateOnEditArticleByDefault BOOL FALSE Switch to determine if the checkbox in the editform to update an index of an article that may be an indexarticle to a file shall be set in general at first.

[edit] Open shortcomings

  • Breaking up on an upload results in a small but useless file in the specified directory - cronjobs may clean this but... :-(

[edit] Historical Versions

You can find the first version of this extension developed by Flominator in the following subarticle:

[edit] Changelog

Date Version Editor Changes
08.08.2007 n/a Flominator
  • Basic extension created from the hack below
28.11.2007 n/a Flominator
  • Support for MediaWiki 1.11 added
14.05.2008 n/a Flominator
  • Debugging added
15.05.2008 v0.1.0.00 raZe
  • First complete reimplementation with better index-word-filtering and outputcontrol...
29.06.2009 v0.2.1.00 raZe
  • Complete reimplementation including specialpage, much more configurable switches...
01.07.2009 v0.2.2.00 raZe
  • Bugfix: $wgFiAutoIndexMark now leads to an automatic update of the index on new uploaded fileversion
03.10.2010 v0.4.5.03 raZe
  • New filetypes configurates (MS Office 2007)
  • Enhanced Specialpage
    • Function to check which files are not yet indexed
    • Specification of more than one files to index at once...
    • ... using Wildcards
  • Intelligent checkbox on uploadform...
  • ...and article editform
  • Enhanced configuration parameters (see section Configuration)...
  • ...in an additional codefile

[edit] Any Questions

For more hints and a place to ask your questions, see Extension talk:FileIndexer

Personal tools
Namespaces
Variants
Actions
Site
Support
Download
Development
Communication
Print/export
Toolbox