User:TJones (WMF)/Notes/Vietnamese Analyzer Analysis

July 2017 — See TJones_(WMF)/Notes for other projects. See also T170423.

Extracting the Corpora
Following my now-standard process, I extracted a reasonable corpus of Vietnamese Wiki articles for test corpora. Vietnamese Wikipedia articles are on average shorter than, say, English or French Wikipedia articles, so I extracted more articles to get a good-sized corpus. However, it turns out that I didn't need it...

Round 1
The plugin author has been great about fixing bugs, so I will do multiple rounds of analysis as long as major fixes occur. My original analysis is now "Round 1".

Building the Plugin
The Vietnamese plugin doesn't have releases for every minor version of Elasticsearch, but the author does have relatively easy-to-follow build instructions for intermediate versions.

I took the code released for 5.3.1 and built my own version for 5.3.2, which is the version of Elasticsearch what Vagrant is using these days. I didn't notice any problems, and I installed it without difficulty.

Given the unexpected behavior of the plugin (see below), I worried I might have had some build problems. I tried updating my Vagrant/MediaWiki instance to ES 5.4.1, but that came with a whole host of other problems—which is why upgrading ES is usually a quarter-long project; I should've know better.

Failing to Re-Index
After installing the plugin I configured my MediaWiki instance to for Vietnamese (but not to use the plugin), and ran my corpora using the default Vietnamese configuration without problem.

I configured the plugin and re-indexed my local wiki, which has fewer than 30 articles in it. They are an eclectic bunch, including text in several languages and scripts. Something in there is causing null pointer exceptions. The on-screen output is condensed JSON with a dozen or so messages like this:

I've put the stack trace on a sub-page, which I will point the plugin author to, as well.

I think the failure may be related to the analysis problem, detailed below.

Failing to Fully Analyze
My analysis tool works by calling the Elasticsearch analyzers directly rather than actually indexing the text. I was able to set up the Vietnamese analyzer under a name that MediaWiki ignored while re-indexing, so that I was able to run it using the tool.

The tool uses the offset information provided by the analyzer to identify tokens and map un-analyzed tokens to analyzed tokens. I've had minor problems before: in one part of the Chinese analysis chain, high surrogates are treated as separate characters (see also T168427), confusing the offsets.

However, in the Vietnamese plugin, offset info is computed more like a counter than a proper offset. The offsets are computed as though each token is separated by a single space. If there are other ignored characters (extra spaces, parens, punctuation, etc) or no intervening characters (as with CJK text, which is tokenized as unigrams), the offsets are wrong. The longer the text being processed, the more wrong the offsets can get.

I changed my script to process line by line. That's much less efficient, but produced fewer—but still very many—errors in my pre-/post-analysis token mapping.

I filed a bug with the plugin author to fix the offsets.

Doing Some Actual Analyzer Analysis
Despite the errors, I noticed some patterns between the default analyzer and the Vietnamese analyzer. In order to further decrease the number of errors in tokenization and pre-/post-analysis mapping, I partially pre-tokenized the text by splitting everything on whitespace, stripping the most common leading and trailing punctuation, parens, quotes, brackets, etc., and—for performance—deduping the list of tokens, leaving one per line. This obviously doesn't allow for multi-word tokens, but I was able to focus on processing at the word and character level.
 * There are a number of multi-word tokens. I'm not sure if that's a good thing or not. It seems that many of the extracted multi-word tokens do translate as (monomorphemic) single words in English, which is a good sign. However, I also noticed that sequences of capitalized words (often names of people, places, or organizations, for both Vietnamese and English words) where tokenized as one token. So, International Plant Names Index is one token.

Some more patterns emerged:
 * Tokens are split on non-Vietnamese characters. Not just the more "esoteric" characters, like phonetic alphabet characters, but any other diacriticked letter, including ç, ö, č, ğ, ı, ş, and š. I also tested å, ø, and ñ, which are all split on. Gràve and ácute accents on aeiou are not split on, neither are tildes on ã or õ. Umlaüts on aeiouy are split on. This is a potential problem for foreign words with "foreign" diacritics.
 * Non-Latin characters are also split on, so Cyrillic, Greek, Arabic, Hebrew, IPA, Armenian, Devanagari, Thai, Chinese, Japanese, and Korean are all split character by character. This is especially bad for alphabetic scripts that do have spaces (most of these except CJK).
 * This splitting interacts oddly with capitalization (possibly the name-matching multi-word tokenizer). año (Spanish "year") is tokenized as a + ñ + o. The capitalized version, Año is tokenized as añ + o. This is weird enough that I filed another bug with the plugin author.
 * There's even more odd behavior with other splitting characters that would normally be ignored: r=22 → r + 22, but R=22 → r= + 22—with the = retained in the token! Similarly, x&y vs X&Y, x*y vs X*Y, x+y vs X+Y, and others.


 * Tokens are not split on hyphens, slashes, or periods (- / .), unless either the first or last element is one character, or the hyphen, slash, or period is repeated. So xx-yy in unchanged, but x-y → x + y. However, xx-y-zz is not split, but xx--yy is split. Same for slashes and periods.
 * Backslashes (\) seem to be always split.
 * Single digit numbers divided by hyphens, slashes, or periods are not split, so 1.1, 1/1, and 1-1 are not split.
 * Mixing numbers and letters is weird. x-1 and xx-1 are split, but x-11 is not split. Same for hyphens and slashes.


 * Punctuation is generally ignored, but three periods (...) is tokenized as an ellipsis.


 * Underscores and colons are split by the Vietnamese analyzer, but not the default analyzer. No problem, just a change.

Conclusion
Unfortunately, I don't think this plugin is mature enough to use, especially on our projects, where foreign words are both common and important.

We should definitely check back in a few versions and see what improvements have been made.

Round 2
The plugin author fixed the majority of the offset problems and some of the odd behavior related to capitalization. So, I was able to do a much better job of analysis of the analyzer.

More coming soon...