User talk:TJones (WMF)/Notes/Esperanto Stemmer Analysis

Jump to navigation Jump to search

About this board

Next Steps: Implementation and Deployment

1
TJones (WMF) (talkcontribs)

@Brion Vibber (WMF), @Dominik, and any others who are interested: My plan is to go ahead with the implementation of the Esperanto stemmer as an Elasticsearch plugin, and then deployment on Esperanto-language wikis. The results look reasonable to me, several people have given generally positive feedback, and the questions and concerns people have had don’t indicate that the word groupings are so bad that they wouldn’t be useful, overall.

Based on the feedback people have kindly shared, the stemmer has improved its accuracy and coverage, and the exception list is very much improved. There are still some ambiguous and incorrect stems, but that just makes Esperanto seem more like a natural language! I’m not too concerned about how non-Esperanto words are treated, because that’s a problem all rules-based stemmers have, especially on projects like Wikipedia and Wiktionary, which contain plenty of foreign words.

The next step will be to wrap the stemmer into an Elasticsearch plugin, which will take a small amount of work. I’ll start on it after I finish my current project. If you have any objections to me continuing with this implementation plan, please let me know. Thanks to everyone who has commented and asked questions!

Reply to "Next Steps: Implementation and Deployment"
Brion Vibber (WMF) (talkcontribs)

Looks mostly good! The only stemming example that looks wrong is shortening to "demokr-" where it should stem as "demokrat-". This'll just have to be put in the dictionary I guess, since "-at-" is also the <s>past</s>present passive participle suffix. A few others look like foreign names (French, German, Latin) that stem slightly odd but acceptably.

Will take a quick look over the Java code shortly.

Brion Vibber (WMF) (talkcontribs)

Stemming exceptions include some things like alternate spellings, like both "chio" and "ĉio". I'm not sure I understand what happens to these; are they removed just from stemming processing and left as-is? Would "ĉion" get stemmed but not "ĉio"?

Brion Vibber (WMF) (talkcontribs)

I'd probably recommend three small changes:

  • either ignore the alternate and incorrect spellings (like 'ghi' and 'gi' for 'ĝi') or normalize them before stemming
  • split the stemmingExceptions list into a list of short particles that should not get stemmed at all and a list of stems that should not be broken down further (eg 'demokratojn' should break down as 'demokrat-o-j-n' not 'demokr-at-o-j-n')
  • some of the stemExceptions have a missing-diacratic spelling (like 'kvazau') but not the version with correct diacritics ('kvazaŭ'), these need to be fixed.

I'll be happy to provide pull reqs for diacritic corrections and see if I can find or pull a list of other word stems that break down weird. :)

TJones (WMF) (talkcontribs)

Thanks, Brion!

Even a nice carefully constructed language has irregularities—especially to a computer! Language is always messy.

I think the list of exceptions came from Wikipedia or Wikibooks, and I don't think the fact that they could be inflected was taken into account (typical English-speaker thinking on my part, at least—we have completely different pronoun forms, for no particularly good reason). The current stemming exceptions are just left unaltered. The goal was to keep ĉio from losing its -o; but ĉion should definitely be treated similarly. If it's easy to say which are can be inflected and which can't in the list, that would be great, otherwise I can try to work it out.

I've been thinking about the stemming options for demokrat-. Keeping in mind that the goal is not necessarily to get a correct stem, but rather a unique stem, maybe it doesn't matter. (Though in this case it picked up Demokrito, too—but stemming names is always a gamble.) It seems like it could get very complex to deal with in the general case. Productive prefixes give related forms like maldemokratia and pseŭdodemokratia—listing them all (either all the related forms or all the acceptable prefixes) would be annoying and prone to problems. On the other hand, while blocking ĉio from having the -o stripped off makes sense, it looks like other words end in -ĉio that are not related, so allowing any arbitrary prefix is ugly, too. Any thoughts on dealing with that? Maybe one way will seem obviously best if you can come up with any other potential problem cases.

Do you have any insight into how often the h-system and x-system forms are used in written text and in searching? If lots of people can't type ĝ and so search for gh or gx, it's probably not something we should ignore. A potential problem is the treatment of foreign words—though it doesn't matter if ghost, though, and laugh are internally represented as ĝost, thouĝ and lauĝ as long as they aren't ambiguous and thus collide with other words. I can try that out and see what impact it has on the words in my sample.

Help with the missing diacritical forms would be great, whether a pull request or a list here or elsewhere.

The to do list:

  • make sure all the exceptions have proper diacritics
  • find the exceptions that can be inflected, like ĉio and handle them properly (add to a general list of unbreakable stems, or explicitly map forms to stems)
  • remove h-system and x-system words from the exception list
  • test the impact of automatic h-system and x-system conversion on stemming collisions; if it's small enough, just do it
  • decide how to handle ambiguous stems like demokrat- (accept defective stems with some errors, do something clever to handle prefixes, or something else TBD)

Thanks for all the help!

TJones (WMF) (talkcontribs)

It took a while to get back to this, both for related and unrelated reasons... whew! Updates:

  • The exception list has been updated to have proper diacritics and no x-system or h-system words, some unneeded exceptions were removed, and a few new ones were added. On GitHub.
  • The stemmer also works on all the exceptions with regular and irregular inflections.
  • I looked into automatic h-system and x-system conversions, for both queries and on-wiki text. Details are here, but the summary is that too many non-Esperanto word get caught up by h-system conversion, and x-system conversion has very little impact. If someone thinks x-system conversion is worth it anyway, it's straightforward to implement.
  • Nothing extra has been done yet with the ambiguous stems, but it also isn't clear how big a problem it is.
Reply to "First look / unua lego"
Dominik (talkcontribs)

I would not rely too much on "Random Groups". for example, "akademian: " can be reduced to "adademi: ", "bazarad: " can be reduced to "bazar: " and so on.

TJones (WMF) (talkcontribs)

Thanks, Dominik. Based on the meaning of -an- and -ad- from English Wiktionary, I think these are correct. Stemmers usually only handle inflectional morphology, which doesn't change the meaning of the word; plurals (-j) and case marking (-n) are good examples. Derivational morphology changes the meaning of the word; example in English include un- and -ness, so happy, happiness, and unhappy are related, but not the same meaning. The boundaries in Esperanto are a lot less clear because the derivations are so regular. This stemmer is only trying to remove inflectional endings.

Reply to "Random groups"
There are no older topics