Extension:SmartIndex

The Smart Index extension allows the user to create an index on any desired wiki page. It consists of a parser tag, which is used where the user wants to create the index, and a special page, which is used to set up and update the associated database tables. After installation and before the use of the parser tag, the index words should be updated from the special page. This creates a database table in which the index words are stored.

The user can optionally enter characters they want stripped from the output. The following characters are removed by default: #.!'[]{}*,"=; -?†—:“”„ in addition to new line. If additional characters are to be removed from the output, I recommend including the default characters as well.

The user also has the option of adding a list of words that should not be included in the index - 'stop words'. The user creates a wiki page under Namespace 0, which is likely the default, and lists the desired stop words separated by new line characters (think 'Enter' key). The user then enters the name of this page in the.

Stop words can also be generated based on the frequency of words in the wiki. The advantage to this approach is that the user does not have to continuously update the list of stop words. Words that have become so common as to be no longer important will no longer be displayed by the index after updating the database. Currently, Smart Index allows the user to choose from simple frequency (the number of times a word appears in the wiki) and inverse document frequency (based on the number of pages on which a given word appears). The easiest approach to finding a threshold value for stop words is to first generate the index as a table and then sort by the word's 'score'. Likely stop words will be clustered together on the table.

Tag parameters
The following parameters can be used with the parser tag: scoreMode, displayMode, template, freqCutoff, and IDFCutoff. The parameters should be within the opening and closing tags and separated by spaces or new line characters.

Example:  scoreMode=frequency displayMode=list freqCutoff=64 

scoreMode
This parameter represents which frequency metric are displayed. When this parameter is not set by the user, the default value is 'default'. When the display mode is set to table, this means the word will appear in the table without a score, when the display mode is set to 'list', the frequency of the word will appear.

Obviously, when the score mode is set to 'frequency', the frequency will appear in both table and list modes.

The other option for this parameter is 'IDF'. When scoreMode is given this value, the word's inverse document frequency (a metric based on the number of pages in which the word appears) is displayed.

displayMode
This parameter determines whether the index is displayed as a sortable table (displayMode=table), or a more traditional index in the form of a list with letter headings (displayMode=list). The default value is 'list'.

template
This parameter allows the user to enter the name of a template for the construction of index entries. The user can determine the form the output will take. Note that the template will only be used if displayMode is set to 'list'.

freqCutoff
If the user gives a value for this parameter, words that appear more frequently will not be included in the index. The value should be given in the form of an integer. Note that only one of freqCutoff and IDFCutoff should be used. If the user provides a value for both, IDFCuttoff is used, as inverse document frequency is generally the better metric.

IDFCutoff
If the user gives a value for this parameter, words that have a lower IDF score will not be included in the index. The value can be an integer or a mixed decimal number. Note that only one of freqCutoff and IDFCutoff should be used. If the user provides a value for both, IDFCuttoff is used, as inverse document frequency is generally the better metric.

Notes on the current version

 * 1) Depending on the size of the wiki, Smart Index can run rather slowly. It may be necessary to increase the maximum amount of time scripts can run in LocalSettings.php
 * 2) The current version ignores text found within templates.