User:Jeblad/Morphology

... rewrite words from one form to another. It is several methods available, and some very radical simplifications can be used in some specific situations like truncating words with Snowball or rewriting words using Soundex for links. Such radical rewrites are possible if the number of possible link targets are sufficiently small. When they are not the rewrite rules must be somewhat more advanced.

Finite state toolkits
Toolkits for finite state transfer is a group of applications similar to or descendants from the proprietary Xerox Finite State Toolkit. Most notable of the derivatives are perhaps Foma, OpenFST and the metaset Hfst. These toolkits are the basis for building morphologies of natural languages.

There are several FOSS morphologies written in lexc/xfst for the Sámi languages, Cornish, Faroese, Finnish, Komi, Mari, Udmurt, Buriat, Greenlandic and Iñupiaq. These forms important baselines for a future morphology extension.

Transducers
During processing of natural language several morphological transducers are used. In wikitext usually only two types of morphological transducers are necessary, one for analysis and one for synthesis of words. Analysis just takes one word in a given language and produce the base form and a set of tokens describing the actual form. Synthesis takes one word and a set of tokens and produce the given inflected form.

A more complete system for translation of natural language will in addition to morphological transducers have grammatical and dependency transducers. Because most of the wikitext is "pre-translated" only some words needs processing and the morphological transducers are the only one necessary. Note that if phrases are processed then grammatical and dependency transducers will be necessary to get good results.

Inside wikitext there are a preprocessor markup form, an analysis form and a synthesis form. The two later forms can also be written both as a tag form and a parser function form. The parser function form is better suited when templates are being substituted while the tag form is better suited when a text is included from a prefill template of some kind.

The preprocessor form is a simplified form for use directly in the running text. The words for preprocessing will be recognized because they are directly followed by at least one known combination of an operator and a token. The operators for morphological processing are one of add token (+), subtract token (-) and use word as example (~). The operators and and tokens can be stringed together.

viessu+N+Sg+Loc
 * Example

In this example viessu is the word to be processed, the +N is a marker for add noun to the processing, the +Sg is a marker for add singular, and the +Loc is a marker for add locative. Together this produce a final form, but note that there might be some dialectic variations.


 * add token (+) : The token is added unconditionally to the synthesis for the
 * subtract token (-)
 * use word (~) : The following word is analyzed and the resulting tokens are used for further processing of the given word.

Doppe leat NaN luohkát+N+Pl+Noms Doppe leat luohkát+N+NaN Pls+Nom