User:TJones (WMF)/Notes/TextCat Optimization for frwiki eswiki itwiki and dewiki

Background
I’ve previously looked at optimizing the set of languages to use with TextCat for language detection on enwiki, and we have an A/B test in the works.

The next step is to do a similar analysis for other big wikis, based on query volume. The next four wikis are Italian, German, Spanish, and French Wikipedias. Due to technical difficulties and personal preferences, I will be looking at French, Spanish, Italian, and then German.

I’ve also done some preliminary work—corpus creation and initial filtering—on the next four candidate wikis: Russian, Japanese, Portuguese, and Indonesian. This was useful for defining and streamlining the process, especially with non-Latin Russian and Japanese.

Query Corpus Creation
The first step for each wiki is to extract a recent corpus of relevant queries, and then do some initial filtering to remove some of the dreck and other undesirable queries in the corpora.

Random sampling
Select 10K random queries from a recent one-week period (generally in March 2016) that meet the following criteria: On one occasion, the Hive query got stuck in the Reduce stage. The only work-around I found was to use a week-long different time period from which to extract the queries.
 * Query came from the search box on .wikipedia.org
 * No more than one query from any given IP for any given day
 * No more than 30 queries per day from that IP
 * Only the _content index was searched (except for wikis that search multiple indexes by default)
 * Query had < 3 results

The Hive query for frwiki is here XXXX.

Initial filtering
There is a lot of junk in our query logs, and some of it is relatively easy to identify with relative accuracy. The process is as follows: This cuts down the query pool by 5-25% (median of 8 languages so far: about 11%.).
 * Extract queries that have the same 1-to-10 character sequence at least three times in a row and manually review—these are mostly junk, but there are some good ones. Put the good ones back.
 * Extract queries that are nothing but consonants and spaces—there are more than you would think, and it works on non-Latin languages (like ru and ja), too! These are all junk, but they are reviewed anyway.
 * Extract queries that have four Latin characters in a row that are not in [aeiouhy]. Again, more than you would think, most are junk. Works in non-Latin languages, too. Works less well in German, but still found a lot of junk.
 * Review remaining queries. Sort and review:
 * Remove most queries with www, http, @, .com, .org, .net, .mobi, .biz, .xxx, .co.uk, and common TLDs for the language under review.
 * Remove queries that aren’t words—mostly numbers, things that look like serial numbers, ID numbers, phone numbers, addresses, etc.
 * Remove queries that are mostly or completely emoji.
 * Note any obvious “other” languages in use in the sample. Different scripts are really obvious because they are grouped when sorted. Other languages using the same script are hit or miss.
 * Incidentally remove any unwanted queries as they go by: proper names, chemical names, any other obvious gibberish not caught by the gibberish filters above.
 * Sort, uniq, and randomly order remaining queries.

Language annotation and further filtering
Once we’ve removed the really obvious junk, it’s time to manually review queries to create a corpus.

* Take a the first 1000 queries from the filtered and randomized sample.

* Run current language identification on it, using the language of the wiki, English (which is everywhere), and any other languages noted during filtering. This is far from perfect, but when it works decently, it’s helpful and reduces context switching. Most of the non-junk queries identified from frwiki as French are in fact French.

* Skim language ID results and see if anything is obviously terrible (e.g., most of the “German” queries are obviously French)  or obviously missing (oops, there’s a query in Armenian) and run again if necessary.

* Review and manually tag the queries, removing queries that are proper names (people, places, language names, companies, products, fictional characters, etc., etc.), acronyms, more gibberish (there’s always more gibberish), scientific terminology and other words that are extremely ambiguous and not specific to any one language, and anything that’s unidentifiable.

* Queries with typos are often left in, even though they make automatic identification hard.

* Longer queries that include a few “undesirable” words are kept. (e.g., “Le declin du système éducatif haïtien. Quelles en sont les causes fondamentales?” would be kept since it is mostly French, but “Haïti” would not because it’s a name.)

* Proper names that are made up of common nouns are kept. (e.g., names of movies, like “Seeking a Friend for the End of the World”, are often English phrases, and are kept.)

For French, this cut the query pool by about two thirds, leaving only one third (three hundred and something) of the queries. So the process was repeated on the next 1000 queries from the filtered and randomized sample.

The result is a corpus of 682 queries (from 2000 reviewed, after ~15% were previously removed) in French.

Thus, only ~30% of the queries that meet the criteria for possible language detection (< 3 results) are actually in an identifiable language. In production, much of the other 70% would also often be labelled as being in a particular language, but those results (on names, acronyms, gibberish, etc.) are unpredictable and any results from another wiki may or may not be helpful. Hence the need for A/B testing after this analysis is done.

Corpus size

While 682 (the size of the French query corpus) is not a huge sample, it’s enough to get a sense of what languages are commonly present in these poor-performing queries, and optimize the choice of what languages to detect. The 95% confidence interval for the margin of error on a proportion (read more here: https://en.wikipedia.org/wiki/Margin_of_error#Calculations_assuming_random_sampling ; calculator here: http://www.scor.qc.ca/en_calculez.html ) maxes out at 50%. For a sample of size 500, that’s 4.38%. For a smaller proportion, the error is smaller, but larger relative to the proportion (e.g., 0.87% for a proportion of 1% out of 500).

Overall, though, that’s good enough for us to say things like, “Based on a sample of 682 poor-performing queries on frwiki that are in some language, about 70% are in French, 10-15% are in English, 7-12% are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there are a handful of other languages present.”—which is enough for us to optimize the languages to be used for language detection for accuracy and run-time performance.

French Results

Other languages searched on Frwiki

Based on a sample of 682 poor-performing queries on frwiki that are in some language, about 70% are in French, 10-15% are in English, about 7-12% are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there are a handful of other languages present.

Below are the results for frwiki, with raw counts, percentage, and 95% margin of error).

In order, those are French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Italian, Corsican, Thai, Swahili, Swedish, Latin, Icelandic, Armenian, Hungarian, Breton.

We don’t have query-trained language models for all of the languages represented here, such as Corsican, Swahili, Breton, Icelandic, Latin, or Hungarian. Since these each represent very small slices of our corpus (1-2 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8,517 remaining queries after the Initial filtering, focusing on queries in other writing systems, there are also a small number of Greek, Hebrew, and Korean queries.

Analysis and Optimization

Using all of the language models available, the performance report (for the 3000-ngram models* we use in enwiki) is below.

* I also ran tests on other model sizes, in increments of 500 up to 10,000. 3000 is still the best model size.

f0.5    f1      f2      recall  prec    total   hits    misses

TOTAL     83.0%   83.1%   83.1%   83.1%   83.0%  681     566     116

French     95.5%   91.7%   88.1%   85.9%   98.3%  468     402     7

English     80.6%   75.5%   70.9%   68.2%   84.5%  88      60      11

Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0

Portuguese     62.5%   66.7%   71.4%   75.0%   60.0%  12      9       6

German     44.3%   50.0%   57.4%   63.6%   41.2%  11      7       10

Spanish     15.4%   20.8%   32.1%   50.0%   13.2%  10      5       33

Russian    100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0

Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0

Dutch     21.3%   28.6%   43.5%   66.7%   18.2%  3       2       9

Corsican      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0

Italian      6.1%    9.1%   17.9%   50.0%    5.0%  2       1       19

Polish     29.4%   40.0%   62.5%  100.0%   25.0%  2       2       6

Armenian    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0

Breton      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0

Hungarian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0

Icelandic      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0

Latin      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0

Swahili      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0

Swedish      7.7%   11.8%   25.0%  100.0%    6.2%  1       1       15

Thai    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0

f0.5    f1      f2      recall  prec    total   hits    misses

Spanish, Dutch, Italian, Polish, and Swedish  do very poorly. They have too few actual instances that they can get correct, which are heavily outweighed by the false positives they do get.

Portuguese and German are not great, either. I reran the analysis without Portuguese and German, and it was better. I added them each back into the mix separately and in both cases the results were worse.

As noted above, Greek, Hebrew, and Korean are present in the larger sample, and from earlier work on the balanced query sets ( https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Balanced_Language_Identification_Evaluation_Set_for_Queries ), our models for these languages are very high accuracy.

So, I dropped Portuguese and German, added Greek, Hebrew, and Korean, and re-ran the performance report with the 3000-ngram models (to check the performance and double-check that Greek, Hebrew, and Korean aren’t causing problems). The results are below:

f0.5    f1      f2      recall  prec    total   hits    misses

TOTAL     89.0%   89.1%   89.1%   89.1%   89.0%  681     607     75

French     94.8%   95.1%   95.5%   95.7%   94.5%  468     448     26

English     67.0%   74.9%   84.9%   93.2%   62.6%  88      82      49

Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0

Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  12      0       0

German      0.0%    0.0%    0.0%    0.0%    0.0%  11      0       0

Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  10      0       0

Russian    100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0

Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0

Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0

Corsican      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0

Italian      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0

Polish      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0

Armenian    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0

Breton      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0

Hungarian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0

Icelandic      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0

Latin      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0

Swahili      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0

Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0

Thai    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0

f0.5    f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for French and English, but overall performance improved. Queries in unrepresented languages were all identified as either French or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

Best Options

The optimal settings for frwiki, based on these experiments, would be to use models for French, English, Arabic, Russian, Chinese, Armenian, Thai, Greek, Hebrew, Korean (fr, en, ar, ru, zh, th, el, hy, he, ko), using the default 3000-ngram models.

Next Up

* Spanish

* Italian

* German