LangConv

Introduction
LangConv is a Finite-State Transducer (FST)-based implementation of script and language conversion for MediaWiki. It contains implementations in JS and PHP, although engines to run the FSTs can be easily built in any language.

LangConv aims to be:


 * Opinionated and rigorous. Rather than allowing arbitrary code execution inside language converters, we restrict transformations to those relations expressible using an FST.
 * Bidirectional. The FST framework allows us to run any transformation "in reverse", in order to determine the set of input texts which can result in the given output text.  We can also bracket texts where there is a precise 1:1 relationship between input and output (that is, reversing the output will result in exactly the input text and no other texts).  This allows us to construct editors in either the output variant or the input variant without corrupting the text.
 * Linguistically grounded. Adding a new conversion pair should be considered foremost a linguistic exercise, using tools familiar to linguists, not a programming challenge.  The specification format and dictionaries should be editable without knowledge of the implementation language (PHP or JavaScript).

LangConv currently fully supports the following languages:
 * crh
 * ku
 * sr

It also contains partial support for:
 * en-x-piglatin, as a debugging aid (word length is limited)
 * zh (the generated FSTs can be quite large)

Current limitations of LangConv are:
 * Scalability to languages (like zh) with large character sets. The high fanout of each node in the FST, combined with the character length of some target matches, causes processing time and generated FST size to grow quite large.  Alternative encodings of the FST are being explored.
 * Limited "memory". This affects en-x-piglatin, where a word size limit must be imposed.  This is not considered a problem in practice.  There are extensions of the FST (flag diacritics) which could help.
 * Limited dynamicism. In zhwiki the set of language conversion rules is frequently amended, not only on a per-page basis but also via a per-wiki "extra rules" page.  This would be easily handled if scalability were solved, since FSTs are efficiently composed.
 * Conversions are specified in the "xfst" language, which is dated and doesn't contain modern software engineering features. There are, however, textbooks available targeted at linguists describing its use.  It may be helpful to clean up the conversion specifications a bit w/ a few carefully engineered extensions to xfst.

In MediaWiki
LangConv has been available in MediaWiki as a composer dependency of  since MediaWiki 1.35.

Everywhere else
Install the wikimedia/langconv package from Packagist:

composer require wikimedia/langconv

Semantic versioning is used.

The major version number will be incremented for every change that breaks backwards compatibility.

Architecture overview
For full reference documentation, please see the documentation generated from the source (or the source itself)


 * Generated API documentation

LangConv executes Finite-State Transducers to perform language conversion.

Examples
XXX this is copied from remex-html, fix me XXX

Construct a DOM from input text
In the above code sample, the pipeline is constructed backwards, from end to start. The constructor of each pipeline stage receives the following pipeline stage. Then with the pipeline fully constructed, $tokenizer->execute causes the whole input text to be parsed and emitted through the pipeline, eventually reaching the DOMBuilder.

Performance
XXX this is copied from remex-html, fix me XXX

Various options can be enabled which improve performance, potentially at the expense of correctness: