User:Jeblad/Morphology

... rewrite words from one form to another. It is several methods available, and some very radical simplifications can be used in some specific situations like truncating words with Snowball or rewriting words using Soundex for links. Such radical rewrites are possible if the number of possible link targets are sufficiently small. When they are not the rewrite rules must be somewhat more advanced.

Finite state toolkits
Toolkits for finite state transfer is a group of applications similar to or descendants from the proprietary Xerox Finite State Toolkit. Most notable of the derivatives are perhaps Foma, OpenFST and the metaset Hfst. These toolkits are the basis for building morphologies of natural languages.

There are several FOSS morphologies written in lexc/xfst for the Sámi languages, Cornish, Faroese, Finnish, Komi, Mari, Udmurt, Buriat, Greenlandic and Iñupiaq. These forms important baselines for a future morphology extension.

Transducers
During processing of natural language several transducers are used. In wikitext usually only two types of transducers are necessary, one for analyzis and one for syntesis of words. Analyzis just takes one word in a given language and produce the base form and a set of tokens describing the actual form. Syntesis takes one word and a set of tokens and produce the inflected form.

Markup in wikitext
Inside wikitext there are a preprocessor markup form, an analysis form and a synthesis form. The two later forms can also be written i both as a tag form and a parser function form. The parser function form is better suited when templates are being substituted while the tag form is better suited when a text is included from a prefill template of some kind.

The preprocessor form is a simplified form for use directly in the running text. The words will be recognized because they are directly followed by at least one operator and a known token. The operators are one of add token (+), subtract token (-) and use word as example (~). The operators and and tokens can be stringed together.

viessu+N+Sg+Loc
 * Example

In this example viessu is the word to be processed, the ...


 * add token (+) : The token is added unconditionally to the synthesis for the
 * subtract token (-)
 * use word (~) : The following word is analyzed and the resulting tokens are used for further processing of the given word.

Doppe leat NaN luohkát+N+Pl+Noms Doppe leat luohkát+N+NaN Pls+Nom