User:Jeblad/NLP

From mediawiki.org

somhow we must avoid structural morphing, inflection and bilingual transfer is ok but not restructuring text

Tag functions for dictionary[edit]

It will be defined a base dictionary, but it might med incomplete. This base dictionary will be defined as one or several pages, and can be protected against further changes. To make it possible to extend the dictionary for each article a few tag functions are necessary.

For local dictionaries we use lttoolbox with some adaptions. (Note that some languages might use external transducers.) The local dictionary will be evaluated inside an language wide environment, and as such should be placed inside a specific section.

Local dictionaries must extend the global one or they must be set up locally which creates a load problem. This again makes it necessary to parse them out of the article and add them to a global repo if they have changed. Then they become global and it is unknown where they might apply. A workaround is to keep the local dictionaries in a local context, and parse and apply them locally, and cache the result with memcached. Not sure how this will work with the morph-by-example -scheme.

The local dictionary will mostly be used for entity names that otherwise can't be easily generated.

Monolingual dictionary[edit]

Bilingual dictionary[edit]

The short form <dict> defines the outer encapsulation of the local dictionary and with the additional attributes id (typically "main") and type (typically "standard"). Inside this there are additional tags <entry>, <pair>, <left>, <right> and <symbol>. The tags are fullform of the usual one char long names.

<dict id="main" type="standard">
    <entry>
        <pair>
            <left>beer</left>
            <right>beer<symbol n="noun"/><symbol n="singular"/></right>
        </pair>
    </entry>
</dict>
{{DICTIONARY:type|left|right|symbol1|…|symbolN}}

Symboler kan også plasseres på venstre side... Mulig løsning fordi det er veldefinerte par koblet via iw-lenker så kan noe ala det nedenstående fungere

{{LANGDICT:form|symbol1|…|symbolN}}

I dette tilfellet vil form angi hvordan ordet skrives gitt de angitte symbolene, og det vil kobles mot motstående ord på andre språk via språklenkene.

Parser functions for morphing[edit]

This will be a short form of synthesize with language transform(?)

This form will inherit the language for the word with the one given as the site wide one or with the one specified within the lexical context.

{{!:word|directives}}
{{morph:word|directives}} //can't use this..

This form will override the language for the word with the locally specified one, while the destination language is still given by the site wide one or with the one specified within the lexical context.

{{!lang:word|directives}}
{{morph:lang|word|directives}}

The initial word will always be analyzed.

The word can be a phrase, then the positional parameters will map to each word

{{!:word1 word2wordN|directives1|directives2|…|directivesN}}

or the positional parameters might be replaced with patterns

{{!:word1 word2wordN|patterns1=directives1|patterns2=directives2|…|patternsN=directivesN}}

If a phrase is detected then structural rewrite is triggered.

Directives[edit]

The directives are pairs of operators and part of speech tags. The classes are noun (N), verb (vbmod, vbser, vbhaver), adjective (Adj) and so forth.[1] Note that there are some differences between different tools. Part of speech tags can also be clustered with parenthesis, and this happens implicit if an example word is used. When this word is analyzed the resulting tags will be clustered. All words that isn't recognized as part of speech tags will be analyzed to produce tags, and also all words enclosed within string delimiters.

The operators ..

+ tag
Add the following part of speech tags unconditionally to the set during synthesis. (Default)
- tag
Remove the following part of speech tags unconditionally from the set during synthesis.
~ tag
The following part of speech tags will be preferably added/removed during synthesis.

Last observed operator takes precedence.

Example[edit]

If the following is evaluated inside Northern Sami Wikipedia, then the following results should be produced.

If we write a call like

{{!no:Aurdal|Alvdalas}}
  → {{!no:Aurdal|+(N Prop Plc Sg Loc)}}
  → {{!no:Aurdal|+N+Prop+Plc+Sg+Loc}}
  → Aurdalas

we simply says "translate the Norwegian Aurdal like the Northern Sami form Alvdalas". The word Alvdalas will then be analyzed and will produce a more complete form, and then this form will be used to produce Aurdalas. Sometimes the results from the analysis will be insufficient and we will have to refine it by adding or removing switches. This can be done like this

{{!no:Aurdal|Alvdalas-Loc+Gen}}
  → {{!no:Aurdal|+(N Prop Plc Sg Loc)-Loc+Gen}}
  → {{!no:Aurdal|+N+Prop+Plc+Sg+Gen}}
  → Aurdala

whereby Aurdal become Aurdala and not Aurdalas. In addition there are times when we don't know if we have a complete match. In those circumstances we want to get as close as possible to a given form, which could be given by an example word. We can write this as

{{!no:Aurdal|~Alvdalas}}
  → {{!no:Aurdal|~(N Prop Plc Sg Loc)}}
  → {{!no:Aurdal|~+N~+Prop~+Plc~+Sg~+Loc}}
  → Aurdalas

Patterns[edit]

The patterns are also pairs of operators and part of speech tags.

The operators ..

+ tag
The following part of speech tags must match the word.
- tag
The following part of speech tags must not match the word.
~ tag
The following part of speech tags will be preferred if it is possible to match the word with them.

Misc[edit]

Analyze[edit]

This function can be used for testing for specific tags or combinations of tags. The initial parameter is a specific word to be analyzed and the successive parameters are filters that should match with the generated list of tags. The tags will only be passed on if the tags matches a filter.

Unconditionally pass on all of the sets

{{analyze:Alvdalas}} → N+Prop+Plc+Sg+Loc
{{analyze:Alvdala}} → N+Prop+Plc+Sg+Gen / N+Prop+Plc+Sg+Acc

The form Alvdalas will produce one tagset, while Alvdala will produce two alternate tagsets.

Conditionally pass on if Loc exist in the set (this uses psitional matching)

{{analyze:Alvdalas|Loc}} → N+Prop+Plc+Sg+Loc
{{analyze:Alvdala|Loc}} →

Only the tagset for Alvdalas matches and will be returned.

Conditionally pass on if Gen exist in the set

{{analyze:Alvdalas|Gen}} →
{{analyze:Alvdala|Gen}} → N+Prop+Plc+Sg+Gen

Only one of the tagsets for Alvdala matches and will be returned.

Conditionally pass on if both Sg and Gen exist in the same set

{{analyze:Alvdalas|Sg Gen}} →
{{analyze:Alvdala|Sg Gen}} → N+Prop+Plc+Sg+Gen

Only one of the tagsets for Alvdala matches and will be returned.

Conditionally pass on if both Sg and Gen exist in the first set set and ..

{{analyze:Alvdalas girku|Sg Gen|Sg Loc}} →
{{analyze:Alvdala girkus|Sg Gen|Sg Loc}} → N+Prop+Plc+Sg+Gen N+Sg+Loc

A variation is to not just pass it on but to rewrite the matched tags into something else. A left hand side is matched and all the matches is replaced by the right hand side of the parameters. The result is then passed on for further processing.

Pass on and rewrite if Sg exist in the set

{{analyze:Alvdalas|Sg=Pl}} → N+Prop+Plc+Pl+Loc

Pass on and rewrite if Sg exist in the set, and filter once more if both Pl and Loc exist in the resulting set

{{analyze:Alvdalas|Sg=Pl|Pl Loc}} → N+Prop+Plc+Pl+Loc

The first filter found that matches, even if it rewrites the tag set, will be returned.

Synthesize[edit]

This function can be used for testing for spesific inflected forms. The initial parameter is a specific word to be analyzed and the successive parameters are filters that should match with the generated list of tags. A synthesized word will only be passed on if the tags matches a filter.

{{synthesize:Alvdalas}} → Alvdalas
{{synthesize:Alvdalas|Nom}} → Alvdalas
{{synthesize:Alvdalas|Gen}} →

A variation is to not just pass on the initial word but to rewrite the matched tags into something else and then synthesize a new inflected form. A left hand side is matched and all the matches is replaced by the right hand side of the parameters. The result is then passed on for further processing.

{{synthesize:Alvdalas}} → Alvdalas
{{synthesize:Alvdalas|Loc=Nom}} → Alvdala
{{synthesize:Alvdalas|Gen=Nom}} →

The first filter found that matches, even if it rewrites the tag set and synthesizes a new inflected form, will be returned.

Not solved[edit]

  • Structural changes will not be available...
  • Gender shift will be a problem, time shift, amount, relations, possessions, names..
  • The tagset must probably be localized
  • How to explicit specify a transducer
  • How to explicit specify a language context