User:TJones (WMF)/Notes/Vietnamese Analyzer Analysis

July / August 2017 — See TJones_(WMF)/Notes for other projects. See also T170423.

Extracting the Corpora
Following my now-standard process, I extracted a reasonable corpus of Vietnamese Wiki articles for test corpora. Vietnamese Wikipedia articles are on average shorter than, say, English or French Wikipedia articles, so I extracted more articles to get a good-sized corpus. However, it turns out that I didn't need it...

Round 1
(July 2017)

The plugin author has been great about fixing bugs, so I will do multiple rounds of analysis as long as major fixes occur. My original analysis is now "Round 1".

Building the Plugin
The Vietnamese plugin doesn't have releases for every minor version of Elasticsearch, but the author does have relatively easy-to-follow build instructions for intermediate versions.

I took the code released for 5.3.1 and built my own version for 5.3.2, which is the version of Elasticsearch what Vagrant is using these days. I didn't notice any problems, and I installed it without difficulty.

Given the unexpected behavior of the plugin (see below), I worried I might have had some build problems. I tried updating my Vagrant/MediaWiki instance to ES 5.4.1, but that came with a whole host of other problems—which is why upgrading ES is usually a quarter-long project; I should've know better.

Failing to Re-Index
After installing the plugin I configured my MediaWiki instance to for Vietnamese (but not to use the plugin), and ran my corpora using the default Vietnamese configuration without problem.

I configured the plugin and re-indexed my local wiki, which has fewer than 30 articles in it. They are an eclectic bunch, including text in several languages and scripts. Something in there is causing null pointer exceptions. The on-screen output is condensed JSON with a dozen or so messages like this:

I've put the stack trace on a sub-page, which I will point the plugin author to, as well.

I think the failure may be related to the analysis problem, detailed below.

Failing to Fully Analyze
My analysis tool works by calling the Elasticsearch analyzers directly rather than actually indexing the text. I was able to set up the Vietnamese analyzer under a name that MediaWiki ignored while re-indexing, so that I was able to run it using the tool.

The tool uses the offset information provided by the analyzer to identify tokens and map un-analyzed tokens to analyzed tokens. I've had minor problems before: in one part of the Chinese analysis chain, high surrogates are treated as separate characters (see also T168427), confusing the offsets.

However, in the Vietnamese plugin, offset info is computed more like a counter than a proper offset. The offsets are computed as though each token is separated by a single space. If there are other ignored characters (extra spaces, parens, punctuation, etc) or no intervening characters (as with CJK text, which is tokenized as unigrams), the offsets are wrong. The longer the text being processed, the more wrong the offsets can get.

I changed my script to process line by line. That's much less efficient, but produced fewer—but still very many—errors in my pre-/post-analysis token mapping.

I filed a bug with the plugin author to fix the offsets.

Doing Some Actual Analyzer Analysis
Despite the errors, I noticed some patterns between the default analyzer and the Vietnamese analyzer. In order to further decrease the number of errors in tokenization and pre-/post-analysis mapping, I partially pre-tokenized the text by splitting everything on whitespace, stripping the most common leading and trailing punctuation, parens, quotes, brackets, etc., and—for performance—deduping the list of tokens, leaving one per line. This obviously doesn't allow for multi-word tokens, but I was able to focus on processing at the word and character level.
 * There are a number of multi-word tokens. I'm not sure if that's a good thing or not. It seems that many of the extracted multi-word tokens do translate as (monomorphemic) single words in English, which is a good sign. However, I also noticed that sequences of capitalized words (often names of people, places, or organizations, for both Vietnamese and English words) where tokenized as one token. So, International Plant Names Index is one token.

Some more patterns emerged:
 * Tokens are split on non-Vietnamese characters. Not just the more "esoteric" characters, like phonetic alphabet characters, but any other diacriticked letter, including ç, ö, č, ğ, ı, ş, and š. I also tested å, ø, and ñ, which are all split on. Gràve and ácute accents on aeiou are not split on, neither are tildes on ã or õ. Umlaüts on aeiouy are split on. This is a potential problem for foreign words with "foreign" diacritics.
 * Non-Latin characters are also split on, so Cyrillic, Greek, Arabic, Hebrew, IPA, Armenian, Devanagari, Thai, Chinese, Japanese, and Korean are all split character by character. This is especially bad for alphabetic scripts that do have spaces (most of these except CJK).
 * This splitting interacts oddly with capitalization (possibly the name-matching multi-word tokenizer). año (Spanish "year") is tokenized as a + ñ + o. The capitalized version, Año is tokenized as añ + o. This is weird enough that I filed another bug with the plugin author.
 * There's even more odd behavior with other splitting characters that would normally be ignored: r=22 → r + 22, but R=22 → r= + 22—with the = retained in the token! Similarly, x&y vs X&Y, x*y vs X*Y, x+y vs X+Y, and others.


 * Tokens are not split on hyphens, slashes, or periods (- / .), unless either the first or last element is one character, or the hyphen, slash, or period is repeated. So xx-yy in unchanged, but x-y → x + y. However, xx-y-zz is not split, but xx--yy is split. Same for slashes and periods.
 * Backslashes (\) seem to be always split.
 * Single digit numbers divided by hyphens, slashes, or periods are not split, so 1.1, 1/1, and 1-1 are not split.
 * Mixing numbers and letters is weird. x-1 and xx-1 are split, but x-11 is not split. Same for hyphens and slashes.


 * Punctuation is generally ignored, but three periods (...) is tokenized as an ellipsis.


 * Underscores and colons are split by the Vietnamese analyzer, but not the default analyzer. No problem, just a change.

Conclusion
Unfortunately, I don't think this plugin is mature enough to use, especially on our projects, where foreign words are both common and important.

We should definitely check back in a few versions and see what improvements have been made.

Round 2
(August 2017)

The plugin author was very quick about making improvements to the plugin, and he fixed the offset errors that had made it impossible to run my analysis. So, I went back and tried again. Unfortunately, the analyzer still threw some exceptions when I tried to reindex my local documents. I was able to track down the problems and I've opened additional tickets.

Some Problems
There are still some problems with the plugin, including some that throw exceptions, and some that give incorrect or inconsistent results.

Whitespace-related errors
The fatal blockers are exceptions. I found a number of cases related to whitespace that cause indexing to fail:
 * Two newlines in a row can cause an error. Analyzing  a gives string index out of bounds exception.
 * This analyzer creates some multi-word tokens, for example, không gian. If there's an extra space between those tokens there's an offset=-1 exception:

không gian


 * Similarly, if there's no space between the tokens, as in, there's an offset=-1 exception.

I filed a bug for these.

Very slow run time
While this isn't an error, it is a big concern.

in my tests, processing Vietnamese Wikipedia articles in 100-line batches, the analyzer is 30x to 40x slower than the default analyzer.

Processing 5,000 articles (just running the analysis, not full indexing) took ~0:17 for the default analyzer. For vi_analyzer on the same text, it took ~8:05. For comparison, I ran the English analyzer on the same text, and it also took ~0:17. These are on my laptop running on a virtual machine, so the measurements aren't super precise, but I did run several batches of different sizes (100 articles, 1,000 articles, and 5,000 articles) and the differences are in the same 30x-40x range, with smaller batches being comparatively slower. Somewhere in the 3x-5x range for complex analysis might be bearable, but 30x may be too much.

I filed a bug for this, too, suggesting some profiling for speed up improvement.

Repeated tokens get incorrect offsets
This is another potential show stopper, since it would mess up phrase matching.

Repeated tokens get incorrect offsets, especially in the presence of extra whitespace and whitespace-like characters. I found examples with spaces and with a right-to-left mark followed by a space. (I left out the right-to-left mark example because it's invisible). I didn't test any other space-like characters.


 * The string  gets indexed as you'd expect.
 * Add a leading space,, and both "b" tokens have the same offsets.
 * Add another leading space and both "a" tokens share offsets, and both "b" tokens share offsets.

" a b a b"

(It's hard to get wikitext to display two spaces properly!)

Below is a real life example from Vietnamese Wikipedia, showing more long-distance duplicates. (Sorry if the text doesn't make sense. I edited out a bit of Arabic script, which also had a right-to-left mark in it, which is not visible.) There are two spaces at the beginning, and two spaces after "Hán". I've bolded the bits that get indexed the same.

" ULY: Lop Nahiyisi, UPNY: Lop Nah̡iyisi ? ; giản thể: 洛浦县; bính âm: Luòpǔ xiàn, Hán  Abraxas friedrichi là một loài bướm đêm trong họ Geometridae. Dữ liệu liên quan tới Abraxas friedrichi tại Wikispecies"

Interestingly Nah̡iyisi is indexed as three parts: Nah + ̡  + iyisi. The pieces, Nah and iyisi are incorrectly indexed into the earlier unbroken Nahiyisi. See the bug below for mroe detailed JSON output.

GitHub bug.

More capitalization inconsistencies
Below are some more examples (as in Round 1) of strings that differ only by case but are tokenized differently.

a1b2c3 => a 1 b 2 c 3 1a2b3c => 1 a 2 b 3 c

A1B2C3 => a1 b2 c3 1A2B3C => 1 a2 b3 c

aa1bbb24cccc369 => aa 1 bbb 24 cccc 369 AA1BBB24CCCC369 => AA1 BBB2 4 CCCC3 69

a_b => a b A_b => A_ b a_ => a A_ => a_

1000x1000 => 1000 x 1000 1000X1000 => 1000 x1 000

X.Jones => x.j ones X.jones => x jones x.Jones => x jones x.jones => x jones

X.Y.Jones => x.y.j ones X.Y.jones => x.y jones X.y.jones => x y jones

XX.YY.JJones => xx.yy.jjones

Names in the "X.Y.Jones" format came up a lot in my sample corpus.

GitHub bug.

Number errors and inconsistencies
There are a bunch of errors and small inconsistencies in the way numbers are parsed:

1) Integers followed by a close paren, and then any character but space or return will make the paren part of the number token.

(10) => 10 (10), => 10) (10)e => 10) e

but:

(10.2) => 10.2 (10.2), => 10.2

2) Periods after numbers before spaces and tabs are not stripped:

10.\n10.\t10. 10. => 10  10.   10.   10

but:

2.10.\n2.10.\t2.10. 2.10. => 2.10  2.10   2.10   2.10

3) Period after number is inconsistently stripped:

10. => 10 10._ => 10. 10., => 10. 10.. => 10.

GitHub bug.

Contextual inconsistencies
Seemingly unrelated characters around a token can sometimes change how it is parsed in unexpected ways.

1) Dash-connected-words are split differently depending on other words and punctuation:

xxx-yyy-zzz => xxx-yyy-zzz w xxx-yyy-zzz => w + xxx + yyy-zzz

but:

. xxx-yyy-zzz => xxx-yyy-zzz w. xxx-yyy-zzz => w + xxx-yyy-zzz w, xxx-yyy-zzz => w + xxx-yyy-zzz w- xxx-yyy-zzz => w + xxx-yyy-zzz

Any of these characters can come after the letter and preserve the dashed-connected-words:

2) Tabs are treated differently depending on what's around them:

x\ty => x\ty

but:

x\ty z => z + y + z z x\ty => z + x + y

GitHub bug.

Many patterns need better word boundaries
1) Slashes: In general slashes and dashes are treated unexpectedly based on how date-like or fraction-like they look in inappropriate contexts. Parsing of dates is fine, but not everything that looks like a date is, and MM/DD/YYYY format is not parsed. Parsing fractions is fine, too, but both dates and fractions should have proper word boundaries around them:

15/11/1910 => 15/11/1910 11/15/1910 => 11/15 + 1910 10/10/10 => 10/10/10 10/30/10 => 10/30 + 10 10/10/10/10/10/10 => 10/10/10 + 10/10/10 10/10/10/10/40/10 => 10/10/10 + 10/40 + 10

2) Dashes: Again, parsing dates is fine, though not everything that looks like a data is, and that causes inconsistencies. MM-DD-YYYY is not matched. Better word boundary detection would make more sense.

15-11-1910 => 15-11-1910 11-15-1910 => 11-15 + -1910 10-10-10 => 10-10-10 10-30-10 => 10-30 + -10 10-10-10-10-10-10 => 10-10-10 + -10 + -10 + -10 10-40-10-10-10-10 => 10-40 + -10 + -10 + -10 + -10 10-000-10-10-10-10 => 10-00 + 0 + -10 + -10 + -10 + -10

0-915826-22-4 => 0 + -915826 + -22 + -4 2-915826-22-4 => 2-9158 + 26-2 + 2-4

x-1y => x -1 y

3) URL domain matching is too aggressive: Domain matching doesn't limit itself to word boundaries, and so can break up tokens oddly, especially when there are accented characters:

Daily.ngày -> Daily.ng + ày aaa.eee.iii -> aaa.eee.iii aáa.eee.iii -> aáa + eee.iii aaa.eée.iii -> aaa.e + ée + iii aaa.eee.iíi -> aaa.eee.i + íi

GitHub bug.

Better handle combining diacritics
Combining characters (incluing diacritics and other characters in non-Latin scripts) cause tokens to split. Some examples from various scripts:


 * ضَمَّة — Arabic
 * বাংলা — Bengali
 * Михаи́л — Cyrillic + combining acute
 * दिल्ली — Devanagari
 * ಅಕ್ಷರಮಾಲೆ — Kannada
 * áa - Latin + combining acute (vs áa with precomposed character)
 * ଓଡ଼ିଆ — Oriya
 * தமிழ் — Tamil
 * తెలుగు — Telugu
 * อักษรไทย —Thai
 * g͡b — International Phonetic Alphabet

All of these get split in bad ways. For some scripts, like Telugu, it ends up splitting on every character, because every other character is a combining character.

I opened a GitHub issue suggesting this could be handled better.

Whitespace and punctuation gets indexed
These whitespace and zero-width characters get indexed, but probably shouldn't be:


 * no break space U+00A0
 * en space U+2002
 * em space U+2003
 * three-per-em space U+2004
 * four-per-em space U+2005
 * six-per-em space U+2006
 * figure space U+2007
 * punctuation space U+2008
 * thin space U+2009
 * hair space U+200A
 * zero width space U+200B
 * zero width non-joiner U+200C
 * zero width joiner U+200D
 * left-to-right mark U+200E
 * right-to-left mark U+200F
 * narrow no-break-space U+202F
 * medium mathematical space U+205F
 * zero width no-break space U+FEFF

These punctuation characters get indexed, too, but probably don't need to be:


 * ‐ (hyphen U+2010, not -, which is "hyphen-minus")
 * – (en dash)
 * — (em dash)
 * ′ (prime)
 * ″ (double prime)
 * （ (fullwidth paren)
 * ） (fullwidth paren)
 * ． (fullwidth period)
 * ： (fullwidth colon)
 * ′ (prime)
 * ″ (double prime)
 * （ (fullwidth paren)
 * ） (fullwidth paren)
 * ． (fullwidth period)
 * ： (fullwidth colon)
 * （ (fullwidth paren)
 * ） (fullwidth paren)
 * ． (fullwidth period)
 * ： (fullwidth colon)
 * ） (fullwidth paren)
 * ． (fullwidth period)
 * ： (fullwidth colon)

I opened a GitHub issue suggesting these don't need to be indexed.

Programmer, debug thyself
So, working with this plugin exposed a few holes in my own tools, including the need for error detection. (It never failed before, really!)

I'll be updating them soon, too.

Some Analysis
I munged my corpus to remove all extra spaces and newlines, and found the few cases where  seemed to be a problem. Then I was able to run an analysis.

Token counts

Total old tokens: 597,408 Total new tokens: 207,404


 * This includes dropping some high-freq tokens that look like stop words (500+ occs) = 88,279 tokens; the long tail of dropped tokens could include a lot more stop words.
 * Multi-word tokens: there are 9,531 types, and 51,237 tokens in the multi-word tokens list, i.e., >51K one-word tokens are lost, since all multi-word tokens are 2+ words, and many are longer.
 * Repeated words lost. See "Repeated tokens get incorrect offsets" above. My tool doesn't count anything if the same token comes from the same offsets repeatedly, so in when indexing the string below it would only count two tokens, not four. I didn't try to quatify this.

" a b a b"

Type counts

Old post-analysis types: 33,587 New post-analysis types: 42,582

There are lots more distinctive types. This is the reverse of the more typical case, caused by the number of multi-word tokens.

There are no new collisions, and a very small number of new splits (5 total). Changes are caused by ß not mapping to ss anymore, whitespace and zero-width characters being stripped off tokens, and one case of soft hyphen being split on.

Other changes


 * measures like 0,68m 1,5GB 1000x1000 10h20 11'59 12Ma are split at the letter/number boundary
 * words are generally broken on apostrophes (straight and curly), periods, colons, and underscores
 * words are split on combining characters (see above)
 * numbers hold on to +/- signs and % signs: +1 -10 -0,4 10%
 * URLs are not split up

Some Conclusions
The plugin is still not ready for prime time. If the plugin author makes updates, I'll try to circle back for more analysis.

I'm not sure whether detecting dates and fractions and holding on to plus signs, minus signs, and percent signs is good for our wikis. On the other hand, these are relatively uncommon, and even if they caused some problems it could be more than offset by the improvement from Vietnamese-specific processing of the text.

I'm going to see if I can get some more info Vietnamese and how people search it to see what Vietnamese speakers need, and whether this plugin provides it, and whether we can get it some other way.