User:TJones (WMF)/Notes/English Homoglyph Before and After Reindexing Report

From mediawiki.org

April 2021 — See TJones_(WMF)/Notes for other projects. See also T274200.

Background[edit]

In addition to trying to measure the impact of the homoglyph plugin on English Wikipedia, I'm also experimenting with the tool I created to do the measuring, and to get a sense of the kinds of "natural" changes that happen to a wiki over time.

Data[edit]

I pulled a 10K sample of English Wikipedia queries from 2021-02-01 to 2021-02-08. I used the usual sampling filters:

  • the sample is across a week, to account for any cyclical effects (weekend queries may be different from weekday queries)
  • sampled queries are limited to one query per IP per day (to reduce bots that slip through other filters, and so power users aren't over represented)
  • we exclude any IPs with more than 100 queries in a day (to reduce the number of bots and other atypical users)
  • we require the session to include a near-match query, which runs before a full-text query and originates in the search box in the upper right (or left) corner of the page (again, to filter in favor of human searchers)

The queries were lightly normalized—lowercased, strings of whitespace converted to a single space, and leading and trailing whitespace trimmed—and deduplicated.

I also created some filters to remove generally low quality queries. In total, 360 queries (36.34%) were filtered:

  • xxx: Variations on xxx, xnxx (a popular porn site), and the word porn accounted for 114 queries.
  • www: Websites and URLs (www, .com, .net, .org, bit.ly) accounted for 54 queries.
  • numbers: Queries consisting entirely of numbers (with an optional + at the beginning) accounted for 17 queries.
  • junk: Queries with the same character 4 times (a fairly reliable sign of junk) in a row accounted for 18 queries.
  • consonants: Queries with 5 consonants in a row (a fairly reliable signal of junk, except in German), or queries that are just 4 consonants accounted for 157 queries.
  • punctuation: Queries with at least 2 characters that were all punctuation and spaces accounted for 0 queries.

I lightly reviewed the filtered queries, and they were generally obviously junk.

9584 unique queries remained, from these I randomly sampled 5000 queries and ran them against current production API before and after reindexing—about 11¾ hours apart. (I used a sample of 5K instead of 1K because I expect the effect of the homoglyph plugin to be much smaller scale than the effect of the Khmer plugin.)

Note: This is the same 10K sample that was used as the control for the Khmer report but with a different sub-sample and the testing reported on here happened about a week later.

English Homoglyph Stats[edit]

I ran a baseline on my 5K sample before reindexing and a comparison 11¾ hours later (reindexing English Wikipedia takes a while).

  • 906 (18.1%) queries originally got zero results
    • 0 (0%) went from 0 results to some results
  • 1210 (24.2%) got a different number of hits
    • from 11 fewer to 176 more hits
    • 270 (5.4%) decreased from non-zero to fewer results
      • from -1.19% (756 to 747) to -0% (123,953 to 123,952)
    • 940 (18.8%) increased from non-zero to more results
      • from +0% (75,048 to 75,049) to 7.14% (14 to 15)
  • 340 (6.8%) changed their top result

English Homoglyph Subsample Stats[edit]

In order to test whether there are any significant differences from sample size—even though I would hope and expect that there aren't—I took a 1K subsample of the 5K sample and ran the same stats.

  • 185 (18.5%) queries originally got zero results
    • 0 (0%) went from 0 results to some results
  • 242 (24.2%) got a different number of hits
    • from 9 fewer to 151 more hits
    • 51 (5.1%) decreased from non-zero to fewer results
      • from -1.19% (756 to 747) to -0% (69,792 to 69,791)
    • 191 (19.1%) increased from non-zero to more results
      • from +0% (57,790 to 57,791) to 5% (20 to 21 )
  • 61 (6.1%) changed their top result

Except for the changed top result and the max outliers, these are all ±0.5%, so subsampling seems as reasonable as one would expect.

English Control Stats[edit]

I ran a control comparison on my 5K sample after waiting another 11¾ hours. I used the post-homoglyph "after" as my new "before" so I only had to run one set of queries. We are not 100% comparing apples-to-apples here because there is probably a time-of-day component to how many people are editing and updating Wikipedia, but it is a reasonable next experiment.

  • 906 (18.1%) queries originally got zero results
    • 1 (+0%) went from 0 results to some results
      • It got 2 new results
  • 1511 (30.2%) got a different number of hits
    • from 17 fewer to 961 more hits
    • 363 (7.3%) decreased from non-zero to fewer results
      • from -50% (2 to 1) to -0% (345,113 to 345,111)
    • 1148 (23.0%) increased from non-zero to more results
      • from +0% (156,332 to 156,333) to 50% (2 to 3)
  • 224 (4.5%) changed their top result

Observations[edit]

We expected the effect of homoglyphs to be fairly small—I believe that Erik and I have each randomly encountered one instance of homoglyphs in the wild—and there really isn't any obvious detectable impact here; the control stats show bigger changes than the homoglyph stats.

However, I did find one query out of 5K that had homoglyphs in it: tοurbіllοn (a watch part and a high-end watch band). It actually mixes Greek (both ο's) and Cyrillic (і) with the Latin.

I also found one query where the homoglyphs would have a potential relevant impact on the results. The query Crayes got 14 results before, and 15 after, because of the addition of one result with a Cyrillic "С".

I also found one page with only one mention of China or Chinese on it, but it also had a Cyrillic "С". Given how many results there are for a search on China, though, it's unlikely to be a relevant change.

Despite the lack of directly measurable impact, there are still obvious cases where the homoglyph plugin is crucial. I've been able to find the following passages on English Wikipedia, French Wiktionary, and Wikidata. Letters highlighted in yellow (and one each in orange) are Cyrillic. This text is readable, but unfindable without the homoglyph plugin.

From English Wikipedia references
From English Wikipedia text
From a French Wiktionary entry
From a Wikidata title

Usefulness of the Before-and-After Tool[edit]

Based on the results here and in the Khmer report, I have a few observations about the before-and-after tool:

  • It can obviously detect big changes, like with Khmer reordering, but has difficulty detecting very small changes, as with homoglyphs, in part because of noise.
  • Longer time periods create more noise in the comparison (Khmer reindexing was very quick, English reindexing was moderately slow).
  • Bigger wikis, like English Wikipedia, probably have more noise because there are more WikiGnomes furiously working on various things (film in 2021 seems to consistently get hundreds of new results per day; high-profile political and news topics can also generate dozens of hundreds of new results in a day).
    • Keep in mind that new results are not necessarily new articles. When a new film comes out, for example, existing articles for dozens of actors, writers, directors, etc. could be updated to now return results for a given query.
  • It's difficult to run a fair control test. It either has to be a different wiki (and possibly a different language—as with Khmer and English) to test over the exact same time period, or a different time period to test the same wiki (as with English homoglyphs).
  • It's not entirely clear what metrics here are the most useful—I partly chose the metrics here because they are easy to gather.
  • There's still a manual component to investigate the likeliest-looking queries to see if they really are affected; it can be hard to tell, though, when there are many, many results.

The next reindexing is likely to be for Spanish; I'm working on unpacking its analyzer now. I expect a somewhat larger impact for unpacking (and adding ICU Folding) than for homoglyphs. I will also experiment with selecting a better wiki to measure. For example, I think Wiktionary may be more affected by unpacking and may reindex more quickly—thereby showing a bigger impact with less noise.

We'll keep refining the process for at least a few more iterations and see how it goes.