LangConv

From mediawiki.org
Jump to navigation Jump to search

LangConv is a Finite-State Transducer (FST)-based implementation of script and language conversion for MediaWiki. It contains implementations in JS and PHP (since it was originally written for Parsoid), although engines to run the FSTs can be easily built in any language.

LangConv aims to be:

  • Opinionated and rigorous. Rather than allowing arbitrary code execution inside language converters, we restrict transformations to those relations expressible using an FST.
  • Bidirectional. The FST framework allows us to run any transformation "in reverse", in order to determine the set of input texts which can result in the given output text. We can also bracket texts where there is a precise 1:1 relationship between input and output (that is, reversing the output will result in exactly the input text and no other texts). This allows us to construct editors in either the output variant or the input variant without corrupting the text.
  • Linguistically grounded. Adding a new conversion pair should be considered foremost a linguistic exercise, using tools familiar to linguists, not a programming challenge. The specification format and dictionaries should be editable without knowledge of the implementation language (PHP or JavaScript).

LangConv currently fully supports the following languages:

It also contains partial support for:

  • Pig Latin (en-x-piglatin), as a debugging aid (word length is limited)
  • Chinese (zh) (the generated FSTs can be quite large)

Current limitations of LangConv are:

  • Scalability to languages (like zh) with large character sets. The high fan-out of each node in the FST, combined with the character length of some rule match strings, causes processing time and generated FST size to grow quite large. Alternative encodings of the FST are being explored.
  • Limited "memory". This affects en-x-piglatin, where a word size limit must be imposed. This is not considered a problem in practice. There are extensions of the FST (eg, flag diacritics) which could help.
  • Limited dynamicism. On zh.wikipedia.org the set of language conversion rules is frequently amended, not only on a per-page basis but also via a per-wiki "extra rules" page. This would be easily handled if scalability were solved, since FSTs are efficiently composed.
  • Conversions are specified in the "xfst" language (foma dialect), which is dated and doesn't contain modern software engineering features. There are, however, textbooks available targeted at linguists (ISBN 978-1575864341) describing its use. It may be helpful to clean up the conversion specifications a bit with a few carefully engineered extensions.

The article Parsoid/LanguageConverter has more details on the current status of the library.

Installation[edit]

In MediaWiki[edit]

LangConv has been available in MediaWiki as a composer dependency of wikimedia/parsoid since MediaWiki 1.35.

Everywhere else[edit]

Install the wikimedia/langconv package from Packagist:

composer require wikimedia/langconv

Semantic versioning is used.

The major version number will be incremented for every change that breaks backwards compatibility.

Architecture overview[edit]

For full reference documentation, please see the documentation generated from the source (or the source itself)

LangConv executes Finite-State Transducers to perform language conversion. These are specified in the fst/ directory in the source. The primary file is named fst/<language code>.foma, for example fst/crh.foma for Crimean Tatar (crh), and this is compiled first to .att files by foma, then to .pfst files for use by LangConv at runtime.

The following utility files are included in most language definitions, via the source statement:

  • brackets.foma: defines bracketing constructs generally useful in formulating complex transformations.
  • roman.foma: defines rules for roman numerals, commonly used in Cyrillic-Latin conversions.
  • safety.foma: defines functions to transform a pair of conversion functions into a function to bracket "safe" strings (strings which can be lossly converted in both directions)

Library code to execute a FST (encoded in a .pfst file) is provided for JavaScript (in lib/ and tests/mocha) and PHP (in src/ and tests/phpunit).

Examples[edit]

Transliterate Serbian from Latin to Cyrillic[edit]

use DOMDocument;
use Wikimedia\LangConv\FstReplacementMachine;

function convertSr( $input ) {
    $machine = new FstReplacementMachine( 'sr', [ 'sr-ec', 'sr-el' ] );
    $doc = new DOMDocument();
    $result = $machine->convert( $doc, 'abcdefg', 'sr-el', 'sr-ec' );
    $resultHTML = $doc->saveHTML( $result );
    return $resultHTML;
}

In the above code sample, we first construct an FST, by loading the appropriate .pfst files for Serbian. We then convert the string abcdefg from Latin (sr-el) to Cyrillic (sr-ec). (These language codes are unusual: T117845.)

The result of the conversion is an HTML fragment, owned by $doc. This is because it may contain metadata on the converted text, such as the source variant and bracketing information to allow it to be losslessly converted back to the source. We return the HTML string representing the conversion results.

Performance[edit]

There are two factors affecting LangConv performance: the size and construction of the FST, and the engine used to execute the FST. Generally FSTs are minimized by foma in order to minimize backtracking. The FST engine implementation in both JS and PHP have also been optimized to minimize the amount of time spent per state.

We also try to minimize the time spent loading the .pfst file. In JavaScript this is done by memory mapping the binary .pfst file. In PHP we read the .pfst as a string and directly execute the FST from that string.

See also[edit]

External links[edit]