User:TJones (WMF)/Notes/Stopwords 2023

From mediawiki.org

May 2023 — See TJones_(WMF)/Notes for other projects. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

[There was a question on the CirrusSearch talk page about updating stopwords, and it's sufficeintly complicated that I wrote a small novel in response. I thought I'd but a slightly adapted version here for reference. Things are always changing so I've labeled this "2023" so when we look back in 3 years we'll know why it's out of date! —T]

Stopword lists[edit]

For many languages, we use the stopword filters built into Elasticsearch, which are based on data from Lucene. Elastics has a list with links to the Lucene code base. Note that there are both Portuguese and "Brazilian"; we use Portuguese. We don't use CJK except for Japanese (and that may change eventually). Oddly the CJK stopword list is all English; we use the actual English list, which is only slightly different from the CJK list.

For some historical reason, certain language analyzers in Lucene are kept separate from the rest. I think it's because they were originally developed outside Lucene. They include:

  • Kuromoji (Japanese, which we don't use, yet)—stopwords
  • Ukrainian Morfologik—stopwords... however, for technical reasons, we maintain our own copy—currently they are the same
    • Technical reasons: We wanted to unpack Ukrainian, but it is only available as a monolithic analyzer, so we ended up recreating the components separately in a plugin.
  • Nori (Korean)—which doesn't use stopwords per se, but rather filters part-of-speech tags put on words by the parser. We have a custom list.
  • SmartCN (Chinese)—it has a stopword list, but it is only punctuation (for technical reasons)
    • Technical reasons: I'm not sure that list existed when we implemented SmartCN. Internally, SmartCN converts all punctuation to commas and doesn't filter them. That's a disaster for indexing, since all punctuation in the entire corpus is indexed. We filter it out.
  • Stempel (Polish)—stopwords

Cirrus Custom Stopwords[edit]

We have some custom stopword lists in CirrusSearch:

  • For Moroccan Arabic (ary) and Egyptian Arabic (arz) but not Standard Arabic (ar), we add a fair number of additional stop words.
  • For Romanian, we add additional variants for some words because the Lucene list is so old that it uses the incorrect letters (ş & ĹŁ) because the correct letters (ș & ț) were not available on computers back then (to be fair, they weren't reliably available until almost 2010).
  • The Mirandese stopword list was provided by a community member, inspired by the Portuguese stopword list.
  • The Polish list is the same as the Stempel list above, except we add "o.o" to go with "o.o."—by the time we get to stopwords, no tokens have final periods, so "o.o." doesn't filter anything.

We have smaller lists of additional stopwoprds that are embedded in the code. (Links to a specific version of the code, otherwise line numbers are unstable. Look at the current code for up-to-date info.)

  • For Armenian, we add two spelling variants. (More details.)
  • For Chinese/SmartCN we have our own punctuation list, which is just a comma (again for technical reasons)
    • Technical reasons: As above, SmartCN converts all punctuation to commas and doesn't filter them. That's a disaster for indexing, since all punctuation in the entire corpus is indexed. We filter it out.
  • We have additional stop word filters for Irish and Polish, but they aren't for proper stopwords, they are just tools for filtering bits and bobs that come up during analysis. (The SmartCN filter is like that, too, I guess.)

Updating Stopwords[edit]

The process to update stopwords depends on the language and where the stopword list comes from, whose list you want to update, and how long you want to wait to see results.

For quicker results for on-wiki search, we can make changes to CirrusSearch. (Tell me about it, open a Phab ticket, or submit a patch!)

If you want to help a wider audience, you could open a ticket or a pull request upstream. Elastic is our immediate source of stopwords for most of these, but they are just wrappers around Lucene, so you can skip Elastic and open the ticket or pull request with Lucene.

For most of the core Lucene stopword lists, there's another source mentioned in the code. The most common sources are Jaques Savoy and Snowball, though there are others. You can try to contact Lucene's upstream source and get them to update their list of stopwords, too, which might reach a wider audience, and might eventually trickle down to Lucene (they did update their Snowball-based stemmers and stopword lists 3 years ago—I think it's ad hoc, but they do update from time to time.)

Why is it so complicated![edit]

At least, I ask myself this now and then. Lucene tries to be the central repository for lots of open source language analysis because they want to make it available to their users, but they don't have everything. We make modifications and customizations in CirrusSearch in response to things we find in our data, or that community members bring to our attention. We try to push things upstream, but it can take a long time, and it's work when there are other things to do.