User:TJones (WMF)/Notes/Language Analysis Morphological Libraries

October 2017 — See TJones_(WMF)/Notes for other projects. See also T171652. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background
After recently testing and implementing several third-party open-source Elasticsearch language analyzers and seeing that some are just simple wrappers around other third-party open-source language analysis software, I decided to go looking for other language analysis software with the potential to be similarly wrapped into Elasticsearch language analyzers that could benefit our wiki communities.

Themes
A few recurring themes emerged:
 * Some code is proprietary or has no licensing information, so even though it might work well, it’s not legally/philosophically available to us.
 * Open-source code gets abandoned or effectively abandoned (i.e., no longer being developed in a form that is useful for us), because:
 * the developer just moved on to other projects.
 * the developer commercialized the project and stopped open-source development.
 * the developer took the project in direction not compatible with our needs (e.g., focusing on massively parallel cluster-based installations, or pulling in huge external libraries).
 * Lots of code is not well-documented in English; this isn’t a huge surprise, but there may be other awesome software that we could use but we just don’t have a good way to learn that it exists. If anyone knows of such awesome software, tell me about it!
 * Lots of code is not in Java.
 * It’s not strictly required that the code be in Java, but it is the easiest to write a wrapper around. While some code has Java integration, through JNI for example, and some programming languages besides Java use the JVM or have JVM implementations, most non-Java options greatly increase complexity.
 * On the other hand, some algorithms are sufficiently straightforward that re-implementing them in Java (or any programming language) wouldn’t be that hard; so that’s always something to keep in mind.
 * For some languages, I found a fair number of research papers, but not with accompanying software or sufficient algorithmic description to actually implement anything.

Selection Criteria
With all that in mind, my criteria for consideration for follow-up came down to the following (updated May 2018 after working with Serbian and Slovak):
 * Code that has a workable license.
 * Code that’s in Java, or uses a straight-forward enough algorithm that it could be ported to Java.
 * Code that isn’t in a huge library and doesn’t have massive dependencies.
 * Code that looks to be reasonably mature (e.g., doesn’t have a huge TO DO list of basic features or other indications that implementation was not complete).

Other important criteria for actual development and deployment (which would be assessed in a follow-up task) include:
 * Accuracy of analysis—so the linguistic results need to be reasonable.
 * Ability to be integrated—it’s possible that the API of the software makes it ridiculously hard to do necessary integration with Elasticsearch.
 * Run-time performance—the code shouldn’t need a giant Spark cluster to run, or be twenty times slower than our current analyzers.

The first three languages I looked at—Japanese, Vietnamese, and Korean—had a lot of options and lots of complexity. The other four—Serbian, Malay, Estonian, and Slovak—had no more than a couple of plausible options, if any.

May 2018 update: I'm now less concerned about "abandoned" code. Forking or porting part or all of a repo and folding it into the search/extra or search/extra-analysis project is not a huge problem, and is worth doing for code that provides useful morphological analysis.

June 2018 update: I'm now thinking "just Java" or portable to Java. C or C++ with Java integration is probably a non-starter; JNI integration in particular has a lot of problems. Obviously, a straightforward C or C++ algorithm could be ported to Java.

Next Steps
Based on my review of these seven languages, I suggest testing some of the software packages. Fortunately, we don’t need to commit to full Elasticsearch integration to perform our standard testing. As long as we can run the analysis and map analyzed tokens back to their original text, we can do a most of the language analysis analysis to determine whether the analyzer is worth pursuing for integration.


 * For Japanese, I want to look at MeCab, tinysegmenter, and possibly CaboCha in more detail.
 * For Vietnamese, I want to look at vnTokenzizer. It is the same library that the previous Elasticsearch analyzer I looked at was based on—the problems it had were with integration with Elasticsearch, not with tokenization.
 * For Korean, I want to look at the newer module named mecab-ko-lucene-analyzer—there are two!
 * For Serbian, I want to test both available stemmers: SerbianStemmer and SCStemmers. SCStemmers, which implements four stemming algorithms, seems to include the algorithm used in SerbianStemmer, but it wouldn’t hurt to compare them. If SerbianStemmer were somehow superior, it would likely be possible to port the improvements to SCStemmers.
 * For Malay, I was only able to find research papers— nothing implemented or implementable that I could find. Update: I've decided to give the existing Elastic Indonesian stemmer a go! (See Malay Update (June 2018) below.)
 * For Estonian, I want to look at Vabamorf.
 * For Slovak, I want to try both of the available stemmers: stemm-sk and Stemmer-sk. The former is in Python but looks to be easily ported to Java if it is awesome, and the latter is already a Lucene analyzer.

I expect some failures. Two of the language analyzers maintained or suggested by Elasticsearch (Japanese and Vietnamese) did not perform as well as we needed them to. However, several others did: those for Polish, Hebrew, Ukrainian, and Chinese (which involved two plugins being melded together). Right now, six of the seven languages I investigated yielded something worth following up on. We’ll see how many of those turn into something usable—if it’s two or three, this is a definitely a process worth repeating. If it is zero, then maybe we need to let the language analyzers mature on their own and come to us when they are ready.

Malay Update (June 2018)
After working on Serbian (T178926/T192395) and Slovak (T178929) and looking at the papers they were based on or translated from, I decided to reconsider what counts as "implementable" for Malay, and review the papers on Malay stemming and compare it to the existing Indonesian analysis.

My understanding of Indonesian and Malay was pretty simple, and that they are "more distinct than American and British English, but less distinct than Spanish and Portuguese". Also, Malay and Indonesian didn't interact in my investigation into fallback languages, where each is used as a fallback language for other languages.

However, looking at the wiki page on the matter, and reviewing some other sources, it seems that a lot of the difference is in Dutch-influenced vs English-influenced spelling of certain sounds, Dutch vs English loanwords, other vocabulary differences, and some pronunciation differences—all of which can decrease mutual intelligibility—but the grammar of the two standard forms seems to be essentially the same.

I also compared the Malay stemmer papers with the Lucene Indonesian stemmer implementation, and verified that they are working on similar affixes. There are some discrepancies, but the core affixes are the same, and the differences seem to come down to what affixes to try to account for (some derivational vs inflectional).

While it's possible that spelling differences or vocabulary differences could increase the error rate for Malay vs Indonesian, it seems to be worth testing; if it is successful, all we need to do it configure it—everything is not only already built, it's already installed, too!

Raw Notes
Below is a table with my notes.