User:TJones (WMF)/Notes/Homoglyphs

From mediawiki.org

Background and Examples[edit]

I been wanting for us to do some sort of automatic homoglyph correction for a long time, and recently Maryum took on the project, and built an Elasticsearch plugin to handle Latin/Cyrillic homoglyphs.

The basic idea is straightforward. If a token (usually a word) has both Latin and Cyrillic characters in it, then:

  • Try to transform all the Latin characters to Cyrillic; if there are no Latin characters left, then add the Latin-less token to the token stream
  • Try to transform all the Cyrillic characters to Latin; if there are no Cyrillic characters left, then add the Cyrillic-less token to the token stream
  • Keep the original token

As an example, chocоlate (with a Cyrillic о in the middle—this was the first word with homoglyphs I ever noticed), would be converted to chocolate (all Latin), and both would be indexed. There's no all-Cyrillic variant because not all of the Latin letters have Cyrillic homoglyphs.

Meanwhile, aрaсe (with Cyrillic consonants and Latin vowels) would be converted to all-Latin apace and all-Cyrillic арасе and all three would be indexed.

KoЯn (mostly Latin, with Cyrillic Я) would only be indexed as-is, because there is no way to map all of the Latin letters to Cyrillic or all the Cyrillic letters to Latin.

Data, Implementation, and Results[edit]

I sampled 10,000 articles each from English, French, Polish, Russian, and Serbian Wikipedias for testing. On my volunteer account I've found and corrected many homoglyphs on each of those Wikipedias.

Maryum ran before and after comparison reports on all five corpora. She added her homoglyph token filter as early as possible in the analysis chain, since homoglyphs are case-sensitive (e.g., В and B are homoglyphs, but в and b are not).

As Maryum reported on Phabricator:

From the analysis chain analysis comparing the chain with and without the homoglyph token filter on a sample of 10,000 random articles for each language:

  • Russian was the most impacted languages during testing with 1,064 new tokens added with the plugin from a sample of 2,911,553 tokens (0.037%)
  • Serbian had 154 new tokens generated out of a sample of 1,396,669 tokens (0.011%)
  • Polish had 32 new tokens generated out of a sample of 1,559,745 tokens (0.002%)
  • English had 30 new tokens generated out of a sample of 3,165,891 tokens (0.001%)
  • French had 7 new tokens generated out of a sample of 2,711,550 tokens (0.000%)

I did a little further digging into the new tokens and collisions (i.e., words that are newly grouped with other words), using Maryum's comparison reports:

For English, all the new tokens are either all Cyrillic or all Latin, so that's good. There are only 8 new collisions in this sample, which are all like Frӧbel/Fröbel and Алeксандрович/Александрович, which is exactly what we want.

For French, we have a couple of weird mixed Cyrillic/Greek tokens being generated from a mixed Latin/Cyrillic/Greek token, which is weird but fine. There are no new collisions in this sample, so the impact on the full Wikipedia will be small, but it should be positive.

For Polish, we have a one mixed Cyrillic/number token, generated from a mixed Latin/Cyrillic/number token, which is good. There are only 12 new collisions, and like the English sample, they are all the kind we'd want: Kozerodа/Kozeroda and комiтет/Комітет.

For Russian, we have a comparatively large number of mixed Latin/Cyrillic/number tokens that generate Latin/number or Cyrillic/number tokens, but that's fine. Russian has a lot more collisions—347—but they are all of the expected type: Сhristopher/Christopher and Беларусi/Беларусі.

For Serbian, we have about 30 unexpected mixed-script tokens! Some are homoglyphs and some are not. Because Serbian has both Cyrillic and Latin alphabets, and both are used on the wiki (with automatic transliteration between them available), we convert all Cyrillic text into Latin text as part of the stemming process, because the actual stemming only works on Latin text.

Some of the source tokens for these are non-Serbian, like Belarusian "Блакiтная", which uses Cyrillic і instead of и. Serbian uses Cyrillic и and Latin i, so it's often easier for Serbian writers to type the Latin variant, and thus we get a mixed-script input like Блакiтная. However, Serbian doesn't have я, so when Блакiтная is converted to Latin, we get blakitnaя., When we convert to a Cyrillic i in Блакітная, we get the transliterated blakіtnaя, with two Cyrillic characters. This is actually okay, because now both Блакiтная and blakіtnaя will have an underlying token in common at search time and will be able to find each other, even if their internal representation is a bit weird. That was the goal all along.

The Serbian sample had 43 new collisions and, desipte the weird tokens, they are all of the desirable type: Соw/Cow and Беларусi/Беларусі.

In general, multi-script languages that use the two scripts that we are testing for homoglyphs may sometimes generate these kinds of weird tokens, but they aren't any worse than existing multi-script tokens, and they are relatively small in number, at least in the Serbian sample.

Next Steps[edit]

Now that the plugin has been built, it needs to be deployed, the homoglyph filter needs to be added to all analysis chain configs, and those changes deployed, and then finally all wikis will need to be re-indexed.

Maryum will likely be working on these tasks in the near future. (Thanks, Maryum!)