Topic on Talk:Core Platform Team/Initiatives/Core REST API in MediaWiki

Search Enhancement: Transliteration, Wrong Keyboard, and DWIM

1
TJones (WMF) (talkcontribs)

(Last one for now... probably!)

In Epic 1.5, item (5) talks about transliteration and DWIM, but those are two very different things.

DWIM, which is currently installed on Russian and Hebrew Wikipedias (and maybe other projects) catches wrong-keyboard mistakes. If I switch to the Russian keyboard and type "dwim" I get "вцшь", on the Hebrew keyboard it's "ג'ןצ". These are not transliterations, and the output is usually gibberish (but recoverable).

Transliteration is much more difficult: the wrong-keyboard mapping is one-to-one and exact (as long as you commit to a particular pair of keyboards), but transliteration can be much harder, and depends not just on the scripts, but also the languages you are transliterating to and from.

For example, Щедрин is transliterated as Shchedrin in English, Sxedrín in Catalan, Ščedrin in Czech, Sjtjedrin in Danish, Schtschedrin in German, Chtchedrine in French, etc. This can be true for any name with Щ in it. Чайковский, on the other hand, is Tchaikovsky in English, instead of the expected Chaikovsky because we adopted the French spelling for.. uh.. "historical reasons".

Crimean Tatar transliteration is word-specific (and depends in part on what language the word came into the language from) and full of exception cases. This code is based on the same source as the Crimean Tatar transliteration used on crh.wikipedia.org.

I'm less familar with the Indic languages—@Santhosh.thottingal knows a ton about them, though—but I believe the transliteration between them is usually/often/sometimes? straightforward, but I worry that the transliteration into English or other langauges using the Latin alphabet may be variable, as with Cyrillic.

Anyway, it would be good to decide which use case you are supporting (maybe both!)—just don't conflate the two!

Reply to "Search Enhancement: Transliteration, Wrong Keyboard, and DWIM"