User:TJones (WMF)/Notes/extra-analysis Elasticsearch Plugin

From mediawiki.org

February/March 2018 — See TJones_(WMF)/Notes for other projects. See also T183015.

Background[edit]

I briefly mentioned the search/extra-analysis plugin in my notes on Serbian analysis, but thought I should flesh it out a bit more after a question came up on Gerrit.

The search/extra-analysis plugin is available on Gerrit and GitHub, with the GitHub docs being the easiest to read.

A Plan: Serbian, et al. and search/extra[edit]

Right now, the "extra-analysis" plugin is effectively the "Serbian-analysis" plugin, but we didn't name it to reflect that because the hope is that other stemmers and analysis tools will join the Serbian stemmer in either the extra-analysis plugin, or its companion, the search/extra plugin.

The search/extra plugin is a repository for us to keep a collection of small-but-useful Elasticsearch tools that we've built over the years. When we decided to add an Elasticsearch plugin wrapper around an open-source Serbian stemmer—keeping in mind plans to try to do the same for other open-source morphological software in the future—we decided that the first draft could be housed in the search/extra plugin. This decreases our ongoing maintenance burden compared to a separate Serbian plugin, and if someone really wants to use it, they could, with only a little extra overhead of the other tools in the plugin.

A New Plan: Licenses, licenses, and more licenses[edit]

Unfortunately, plans often do not survive contact with reality, and as the Serbian stemmer is licensed under GPLv3, it most likely could not be bundled into the search/extra plugin (licensed under Apache 2.0) without converting everything to GPLv3, which we didn't want to do.

So, we created a new plugin repository, search/extra-analysis, which is licensed under GPLv3, and which can incorporate future compatibly licensed morphological analysis libraries we want to build on.

We did end up this time with a new plugin repo to support, but going forward we shouldn't need another project/repo for every new Elastic plugin we want to build for search. Anything we build ourselves will be licensed under Apache 2.0 and can go into search/extra. Any open-source work we build on that has a permissive license compatible with Apache 2.0 (MIT, BSD, etc.) will also probably go in search/extra. Open source works with GPLv3-compatible licenses (GPLv3, GPLv2+, LGPL) will go into search/extra-analysis.

I am not a lawyer, but it seems to be common wisdom that GPLv2 (as opposed to GPLv2+) is incompatible with GPLv3, so that may be a problem in the future. But, as they say, we can burn that bridge when we get to it.

Future Plans[edit]

If a lot of people start using the Serbian stemmer (or some other future analysis tool we incorporate into search/extra or search/extra-analysis), we might consider spinning it off into its own project and trying to do a proper job maintaining it, with releases for every version of Elasticsearch, etc. But for now, that would be time and effort I'd rather spend working on other small-but-useful plugins—doing something helpful for Khmer or Chinese, for example—while minimizing the work needed to support those efforts, so this is our compromise.