User:Jeblad/Morphology

... rewrite words from one form to another. It is several methods available, and some very radical simplifications can be used in some specific situations like truncating words with Snowball or rewriting words using Soundex for links. Such radical rewrites are possible if the number of possible link targets are sufficiently small. When they are not the rewrite rules must be somewhat more advanced.

The focus of the text is to make a system that works in a wiki, not to make a full-fledged translation.

Finite state toolkits
Toolkits for finite state transfer is a group of applications similar to or descendants from the proprietary Xerox Finite State Toolkit. Most notable of the derivatives are perhaps Foma, OpenFST and the metaset Hfst. These toolkits are the basis for building morphologies of natural languages.

There are several FOSS morphologies written in lexc/xfst for the Sámi languages, Cornish, Faroese, Finnish, Komi, Mari, Udmurt, Buriat, Greenlandic and Iñupiaq. These forms important baselines for a future morphology extension.

Transducers
During processing of natural language several morphological transducers are used. In wikitext usually only two types of morphological transducers are necessary, one for analysis and one for synthesis of words. Analysis just takes one word in a given language and produce the base form and a set of tokens describing the actual form. Synthesis takes one word and a set of tokens and produce the given inflected form. In wikitext both analysis and synthesis form are usually integrated, were the analysis is the first pass of the run and the synthesis is the second pass.

A more complete system for translation of natural language will in addition to morphological transducers have grammatical and dependency transducers. Because most of the wikitext is "pre-translated" only some words needs processing and the morphological transducers are the only one necessary. Note that if phrases are processed then grammatical and dependency transducers will be necessary to get good results.

Inside wikitext there are a preprocessor markup form, a tag form and a parser function form. The parser function form is better suited when templates are being substituted while the tag form is better suited when a text is included from a prefill template of some kind. (The tag form seems redundant.) Processing of text due to preprocessor markup will not extend beyond marked word. This is important to keep the load down.

The preprocessor form is a simplified form for use directly in the running text. The words for preprocessing will be recognized because they are directly followed by at least one known combination of an operator and a token. The operators for morphological processing are one of add token (+), subtract token (-) and use word as example (~). The operators and and tokens can be stringed together.


 * add token (+) : The token is added unconditionally as a processing directive during synthesis. If it already exist it will not be added again, and other preceding processing directives can drop later processing directives silently.
 * subtract token (-) : The token is removed unconditionally as a processing directive during synthesis. If it does not already exist in the list of processing directives it will be dropped silently.
 * use word (~) : The following word is analyzed and the resulting tokens are added unconditionally as a processing directive during synthesis.

To rewrite the word viessu (house) from Northern Sami into viesus do something like

viessu+N+Sg+Loc &rarr; viesus
 * Example

In this example viessu is the word to be processed, the +N is a marker for add noun to the processing, the +Sg is a marker for add singular, and the +Loc is a marker for add locative. Together this produce a final form, but note that there might be some dialectic variations.

Because analysis of viessu will produce viessu+N+Sg+Nom it is also possible to write the following

~viessu-Nom+Loc &rarr; viesus
 * Example

The preprocess form has the additional property that the language of the word is implicit from the content language, and during the initial preprocessing the example will be rewritten as  if the content language is "se".

The parser form adds a language for the word to be analyzed but are otherwise similar. This form can process several words, and in the final form they will be joined with whitespace.
 * Example

Doppe leat NaN luohkát+N+Pl+Noms Doppe leat luohkát+N+NaN Pls+Nom