User:TJones (WMF)/Notes/Stempel Analyzer Analysis/Recompiling the Stemmer

Recompiling the Stemmer
The original tables of stems which were used to train Stempel are not presently available. However, the process outlined in the Apache documentation for Stempel is probably reproducible, using Stempel itself.

Based on email from Leo Galamboš, one of the authors of Egothor, on which Stempel is built (see Apache docs above), the process for creating a table is as follows:
 * Download http://www.getopt.org/stempel/stempel-1.0.jar
 * Prepare a table—First term is the lemma/stem, and the rest of the line contains all respective variants (example from English Egothor table):

A-bomb   A-bombs abacus   abacuses abandon   abandons abandoning abandoned abase   abases abasing abased abate   abates abating abated abbess   abbesses

It will compile "en_table" using "-0E2" method (Elasticsearch uses "-0ME2" which may not be better) and the product is saved into en_table.out file However, there is also a  class in the current Elasticsearch distribution, which likely performs the same process, but with the correct headers.
 * Run
 * We can’t quite just replace the old file with en_table.out, because it would be necessary to change en_table.out to Stempel's format—it needs the UTF String header with opt-method signature.

All words in the table will be transformed as the table specifies. Unknown words may be transformed incorrectly. So, if there is a word-stem pair that is processed incorrectly, we can add the correct transformation pair into the table, recompile it, and it would be fixed.

The problem is getting appropriate data for the table of stems/lemmas. As the Apache documentation discusses, this data was originally derived from tagged corpora and the output of a different stemmer (SAM). We could generate a similar corpus using data from English and Polish Wiktionary, and using the current version of Stempel the way Stempel used SAM.

Polish Wikipedia provides a ready source of frequency data for Polish words. Using lemmas for the most common words in Polish Wikipedia should provide ample training data for the stemmer, and having an uncompiled table available would allow us to update it to fix problem cases.