Parsoid/LanguageConverter

From MediaWiki.org
Jump to navigation Jump to search

LanguageConverter is MediaWiki functionality to automatically convert the content of a page into a different language variant. A variant is mostly the same language in a different script, although it might involve spelling differences or other changes that can be automated. This allows a single wiki to be shared by different language communities.

Functionality Overview[edit]

MediaWiki articles using LanguageConverter are stored in a mix of variants, with the principle being that any individual editor should write in whatever variant they are most comfortable with, while avoiding unnecessary "rewrites" of existing content into their preferred variant. This is similar to the "Retaining the existing variety" principle in the Wikipedia Manual of Style: it avoids "dirty diffs" and rewrite wars and ensures that the changes shown in history correspond to actual changes in content, not variant conversion. However, some wikis have conventions that encourage editors to use a certain variant, for example where most editors are fluent in both writing systems or where variant conversion in a certain direction is more reliable.

See Writing Systems/Syntax for a brief overview of wikitext syntax for LanguageConverter, and language conversion blocks for the Parsoid HTML corresponding to these syntactic constructs.

LanguageConverter takes as input the mixed-variant article text and converts it to a single consistent variant, including the displayed title of wiki links. Before marking redlinks, a missing article title is also converted to each potential variant, and the destination URL is adjusted if one of these variants corresponds to a valid article present on the wiki. (When attempting to load a missing article, a similar process is followed to redirect to a present article if possible.)

In some cases, the fact that the original article text is an unmarked mix of variants can cause issues. Consider if we were to use LanguageConverter for American and British English and the article text contained the word "lift" and we were attempting to convert to American English. We would not know whether to convert this to "elevator" without knowing if the original article text was in British or American English. A similar issue appears in languages using Cyrillic script, where Roman numerals have to be recognized and explicitly shielded from conversion, since they are supposed to remain in Latin script even when the target variant is Cyrillic. Some language implementations contain ad hoc heuristics to "guess" the original language of a section of text in order to inform conversion. We discuss this issue further below.

Parsoid Implementation[edit]

The original implementation of LanguageConverter in PHP allows the use of arbitrary code to perform the conversion; however, most languages implemented to date rely heavily on the PHP strtr function (with array arguments) to replace all occurences of a table of from => to pairs. This approach isn't quite powerful enough, especially in regard to recognizing word boundaries, so it is augmented by a set of much slower regular expression matches. Occasionally there is also some ad-hoc segmentation done, for example to remove Roman numerals as discussed above.

The Parsoid implementation uses a finite-state transducer (FST) framework in an attempt to unify the segmentation, strtr, and regular expression phases and improve the structure of the code. The goal (not yet completely attained) is to allow specification of new variants in an entirely declarative fashion using FST tools familiar to linguists, without forcing language experts to write custom PHP code. In addition, the use of the FST tools allows additional functionality and error-checking. The "variant A to variant B" and "variant B to variant A" FSTs can be composed to allow the efficient recognition of strings which can be round-tripped losslessly between variants; this can be used to allow native-script editing or to suggest where explicit markup should be added to disambiguate. The FST can also be tested for completeness and ambiguity, allowing the identification of conflicting rules and other problems.

Current Status[edit]

The Parsoid LanguageConverter implementation project was split into four phases as described in phab:T43716 to separate languages by the number and type of LanguageConverter features they used.

Working[edit]

  • Crimean Tatar (crh): one known issue with \b
  • Kurdish (ku): the underlying PHP converter appears to have issues, including duplicate and unused entries
  • English Pig Latin (en-x-piglatin): not enabled by default on WMF wikis, but useful for testing purposes. Uses consonant cluster subset to manage space.

Will implement after Parsoid/PHP port[edit]

  • Chinese (zh): requires pulling additional language converter rules from article space; also issues with size of character space.
    • Technical details: LanguageConverter grabs additional rules from MediaWiki:Conversiontable/<lang> and recursively from articles linked from there in the PHP function LanguageConverter::parseCachedTable(). See these rules on zhwiki.
    • The construction of the FST for zhwiki can take hours because the algorithms used appear to be sensitive to the out-degree of the nodes of the FST. This is mitigated by running the FST directly on the UTF-8 bytes instead of the unicode codepoints, which reduces the out degree to ~64, but the conversion is manual and awkward. We hope to improve the tooling around this process; we may also investigate algorithms that won't be as sensitive to out-degree. The runtime performance is not an issue as it is independent of out-degree; this issue only relates to how expensive it is to compile the FST from the language conversion rules.

Will deprecate in PHP Parser[edit]

  • Serbian uses LanguageConverter::guessVariant() (code here), which uses an ad-hoc heuristic which is extremely dependent on the exact boundaries of the text chunk passed to the converter. Since Parsoid and PHP emit different markup -- in particular, Parsoid emits a number of "invisible" <span> tags to denote semantic information which is missing from the PHP output -- it is likely impossible to make this heuristic work the same in Parsoid and the PHP parser. Our intent is to deprecate LanguageConverter::guessVariant(), which will require working with srwiki on linting and correcting pages which depend on it.
    • Example: sr:Канизи contains segments (like "GeoNames ИД") which are not converted to consistent Cyrillic because it doesn't contain any of the special characters шђчћжШЂЧЋЖ or šđč枊ĐČĆŽ. Putting span tags in different places (for example around template contents) changes the boundary of the text passed to LanguageConverter, which can change the number of the special characters present and thus change the output of LanguageConverter.

Not yet ported[edit]

  • Gan Chinese (gan): reuses Simplified/Traditional tables from zhwiki
  • Inuktitut (iu): no obvious issues, bespoke script
  • Kazakh (kk): cyrillic, latin *and* arabic scripts
  • Shilha (shi): no obvious issues, bespoke script
  • Tajik (tg): no obvious issues, Cyrillic script
  • Uzbek (uz): no obvious issues, Cyrillic script

Won't fix[edit]

None known (yet).

Other issues[edit]

Title redirection

Future Work[edit]

  • Add mechanism to select a default variant for a wiki, and for a specific article.
  • Editing support
  • html2wt support
  • Improve tooling; provide clean declarative language
  • Character set size issues with CJK -- worked around this by dispatching on each byte of UTF-8 encoding; perhaps there's a better solution.