User:TJones (WMF)/Notes/Language Analysis Morphological Libraries

October 2017 — See TJones_(WMF)/Notes for other projects. See also T171652.

Background
After recently testing and implementing several third-party open-source Elasticsearch language analyzers and seeing that some are just simple wrappers around other third-party open-source language analysis software, I decided to go looking for other language analysis software with the potential to be similarly wrapped into Elasticsearch language analyzers that could benefit our wiki communities.

Themes
A few recurring themes emerged:
 * Some code is proprietary or has no licensing information, so even though it might work well, it’s not legally/philosophically available to us.
 * Open-source code gets abandoned or effectively abandoned (i.e., no longer being developed in a form that is useful for us), because:
 * the developer just moved on to other projects.
 * the developer commercialized the project and stopped open-source development.
 * the developer took the project in direction not compatible with our needs (e.g., focusing on massively parallel cluster-based installations, or pulling in huge external libraries).
 * Lots of code is not well-documented in English; this isn’t a huge surprise, but there may be other awesome software that we could use but we just don’t have a good way to learn that it exists. If anyone knows of such awesome software, tell me about it!
 * Lots of code is not in Java.
 * It’s not strictly required that the code be in Java, but it is the easiest to write a wrapper around. While some code has Java integration, through JNI for example, and some programming languages besides Java use the JVM or have JVM implementations, most non-Java options greatly increase complexity.
 * On the other hand, some algorithms are sufficiently straightforward that re-implementing them in Java (or any programming language) wouldn’t be that hard; so that’s always something to keep in mind.
 * For some languages, I found a fair number of research papers, but not with accompanying software or sufficient algorithmic description to actually implement anything.

Selection Criteria
With all that in mind, my criteria for consideration for follow-up came down to the following:
 * Code that’s doesn’t seem abandoned—say, 2-3 years old at most—though older code is fine if there’s an obvious active user base (indicating that it still works well).
 * Code that’s in a reasonable programming language—Java; C or C++ with Java integration; or a simple enough algorithm that it could be ported to Java.
 * Code that isn’t in a huge library and doesn’t have massive dependencies.
 * Code that looks to be reasonably mature (e.g., doesn’t have a huge TO DO list of basic features or other indications that implementation was not complete).

Other important criteria for actual development and deployment (which would be assessed in a follow-up task) include:
 * Accuracy of analysis—so the linguistic results need to be reasonable.
 * Ability to be integrated—it’s possible that the API of the software makes it ridiculously hard to do necessary integration with Elasticsearch.
 * Run-time performance—the code shouldn’t need a giant Spark cluster to run, or be twenty times slower than our current analyzers.

The first three languages I looked at—Japanese, Vietnamese, and Korean—had a lot of options and lots of complexity. The other four—Serbian, Malay, Estonian, and Slovak—had no more than a couple of plausible options, if any.

Next Steps
Based on my review of these seven languages, I suggest testing some of the software packages. Fortunately, we don’t need to commit to full Elasticsearch integration to perform our standard testing. As long as we can run the analysis and map analyzed tokens back to their original text, we can do a most of the language analysis analysis to determine whether the analyzer is worth pursuing for integration.


 * For Japanese, I want to look at MeCab, tinysegmenter, and possibly CaboCha in more detail.
 * For Vietnamese, I want to look at vnTokenzizer. It is the same library that the previous Elasticsearch analyzer I looked at was based on—the problems it had were with integration with Elasticsearch, not with tokenization.
 * For Korean, I want to look at the newer module named mecab-ko-lucene-analyzer—there are two!
 * For Serbian, I want to test both available stemmers: SerbianStemmer and SCStemmers. SCStemmers, which implements four stemming algorithms, seems to include the algorithm used in SerbianStemmer, but it wouldn’t hurt to compare them. If SerbianStemmer were somehow superior, it would likely be possible to port the improvements to SCStemmers.
 * For Malay, I was only able to find research papers—nothing implemented or implementable that I could find.
 * For Estonian, I want to look at Vabamorf.
 * For Slovak, I want to try both of the available stemmers: stemm-sk and Stemmer-sk. The former is in Python but looks to be easily ported to Java if it is awesome, and the latter is already a Lucene analyzer.

I expect some failures. Two of the language analyzers maintained or suggested by Elasticsearch (Japanese and Vietnamese) did not perform as well as we needed them to. However, several others did: those for Polish, Hebrew, Ukrainian, and Chinese (which involved two plugins being melded together). Right now, six of the seven languages I investigated yielded something worth following up on. We’ll see how many of those turn into something usable—if it’s two or three, this is a definitely a process worth repeating. If it is zero, then maybe we need to let the language analyzers mature on their own and come to us when they are ready.

Raw Notes
Below is a table with my notes.