User:Jeblad/NLP

somhow we must avoid structural morphing, inflection and bilingual transfer is ok but not restructuring text

Tag functions for dictionary
It will be defined a base dictionary, but it might med incomplete. This base dictionary will be defined as one or several pages, and can be protected against further changes. To make it possible to extend the dictionary for each article a few tag functions are necessary.

For local dictionaries we use lttoolbox with some adaptions. (Note that some languages might use external transducers.) The local dictionary will be evaluated inside an language wide environment, and as such should be placed inside a specific section. The short form  defines the outer encapsulation of the local dictionary and with the additional attributes id (typically "main") and type (typically "standard"). Inside this there are additional tags,  ,  ,   and. The tags are fullform of the usual one char long names.

Local dictionaries must extend the global one or they must be set up locally which creates a load problem. This again makes it necessary to parse them out of the article and add them to a global repo if they have changed. Then they become global and it is unknown where they might apply. A workaround is to keep the local dictionaries in a local context, and parse and apply them locally, and cache the result with memcached. Not sure how this will work with the morph-by-example -scheme.

The local dictionary will mostly be used for entity names that otherwise can't be easily generated.

Parser functions for morphing
This form will inherit the language for the word with the one given as the site wide one or with the one specified within the lexical context. |  //can't use this.. This form will override the language for the word with the locally specified one, while the destination language is still given by the site wide one or with the one specified within the lexical context.

The initial word will always be analyzed.

The word can be a phrase, then the positional parameters will map to each word | or the positional parameters might be replaced with patterns | A pattern will use a best match (?)

Directives
The directives are pairs of operators and part of speech tags. The classes are noun (N), verb (vbmod, vbser, vbhaver), adjective (Adj) and so forth. Note that there are some differences between different tools. Part of speech tags can also be clustered with parenthesis, and this happens implicit if an example word is used. When this word is analyzed the resulting tags will be clustered. All words that isn't recognized as part of speech tags will be analyzed to produce tags, and also all words enclosed within string delimiters.

The operators ..
 * + tag : Add the following part of speech tags unconditionally to the set during synthesis.
 * - tag : Remove the following part of speech tags unconditionally from the set during synthesis.
 * ~ tag : The following part of speech tags will be preferred during synthesis.

Last observed operator takes precedence, except the fuzzy operator which is sticky. It will remain set on a tag, even if the tag is added once more, but the tag can be removed.

Example
If the following is evaluated inside Northern Sami Wikipedia, then the following results should be produced.

If we write a call like → → Aurdalas we simply says "translate the Norwegian Aurdal like the Northern Sami form Alvdalas". The word Alvdalas will then be analyzed and will produce a more complete form, and then this form will be used to produce Aurdalas. Sometimes the results from the analysis will be insufficient and we will have to refine it by adding or removing switches. This can be done like this → → Aurdala whereby Aurdal become Aurdala and not Aurdalas. In addition there are times when we don't know if we have a complete match. In those circumstances we want to get as close as possible to a given form, which could be given by an example word. We can write this as →  → Aurdalas

Patterns
The patterns are also pairs of operators and part of speech tags.

The operators ..
 * + tag : The following part of speech tags must match the word.
 * - tag : The following part of speech tags must not match the word.
 * ~ tag : The following part of speech tags will be preferred if it is possible to match the word with them.

Misc
→ N Prop Plc Sg Loc → Loc → →  → Loc

→ Alvdalas → Alvdala ...

Not solved
Structural changes will not be available... Gender shift will be a problem, time shift, amount, relations, possessions, .. names..