User:OrenBochman/Search/NLP Tools/Morphology

From mediawiki.org

Tasks[edit]

Data set: <Word,Language,POS>[edit]

Generate Via Xpath OR jawk extractor script
Input: Wiktionary XML Dump

Possible Output Built From This Dataset[edit]

Data Deliverable Via Note
<Unicode-Letter,Language> distribution unicodeCounterModel
<Word,Language> distribution Mallet
<token,POS>|language distribution Mallet
language|token discriminator Mallet offers all choices
language|<token> discriminator dynamic programing (Forwards Backwards) offers best choice,
POS|Token pre-tagger Mallet offers all choices,
POS|<token,language> discriminator dynamic programing (Forwards Backwards) offers best choice,
MD5|token</nowiki> calculator Java class store in the DB for HTML dump lookup
token|{NGram} distribution SIRLM or Lucene Ext all NGRAM
token|{HNGram} distribution Customized Lucene Ext all HNGRAM
d(X|token) spelling checker Lucene NGram distance

Each of this models will require a script to build. If a much better solution is available down the line incorporating unavailable unavailable feature vectors these simplistic models may be skipped to reduce production time. The same would be true for models which do not have token/lemma frequency data as derived from a normalized corpus.

Wiktionary contain extinct languages (Old English) and very limited languages. These entries need to be marked and then Excluded from analysis to reduce noise in the models.

Data set: <Language,POS,Word,Forms>[edit]

Generate Via <Hyper-Ngram closure,residue> clustering algorithm.
Input: Wiktionary HTML Dump

a lemma is:

  • LEM1 a citation form and a list of inflected forms
  • LEM2 a citation form and a list of <inflected forms, morphological state>
  • LEM3 a citation form and a list of <inflected forms, morphological state, frequency count>
  • a good lemma-analyzer is essential to provide good recall and precision in morphologically rich language.
  • a single token lemma analyzer could be built from a lemma database.
  • better results can be achieved by considering adjacent tokens or via other linguistic considerations.
  • a learning component would be useful in processing unknown tokens in one of two ways
  • the description used in the database could be such that it could accept rare inflections that follow known regular transformation. (template/reside); (stem/affix) etc
  • it could try to assign unknown lemmas to a subset if forms eliminating options as more evidence appears. (maximum entropy model)

<Hyper-Ngram closure,residue> clustering algorithm[edit]

  • the algorithm would adjust its own parameters according to the morphology detected
  • the algorithm should bootstrap morphological states via a manually tagged template.
  • the algorithm would bootstrap morphological data from table of morphological forms. (supervised learning).
  • the algorithm would be able to generalize for unknown terms to their full lemma form. (unsupervised learning).
  • it would collect ngram and hngrams.
  • words would be clustered by (h)ngram similarity (nearest k neighbors)
  • hngrams spans would be listed by order of increasing generality or decreasing specificity.
    (cf local information and global entropy)
  • residues extracted from non specific spans.
  • clustering can now proceed in earnest - two dimensional clustering by residue, by span's ngrams. (template similarity)
  • to further increase generality of such a model one replaces consonants a vowels with equivalencies (clustering by phonomic types)


Possible Output Built From This Data-set[edit]

Data Deliverable Via Note
<Lemma,Language> distribution Mallet
<Lemma,POS>|language distribution Mallet
language|lemma discriminator Mallet offers all choices
language|<lemma> discriminator dynamic programing (Forwards Backwards) offers best choice,
POS|Token pre-tagger Mallet offers all choices,
POS|<token,language> discriminator dynamic programing (Forwards Backwards) offers best choice,
MD5|token</nowiki> calculator Java class store in the DB for HTML dump lookup
token|{NGram} distribution SIRLM or Lucene Ext all NGRAM
token|{HNGram} distribution Customized Lucene Ext all HNGRAM
d(X|token) spelling checker Lucene NGram distance

Each of this models will require a script to build. If a much better solution is available down the line incorporating unavailable unavailable feature vectors these simplistic models may be skipped to reduce production time. The same would be true for models which do not have token/lemma frequency data as derived from a normalized corpus.

Wiktionary contain extinct languages (Old English) and very limited languages. These entries need to be marked and then Excluded from analysis to reduce noise in the models.

<Word, Language, POS, Count>.csv[edit]

  • Tool: Xpath OR Awk extractor script
  • Input: XX.Wikipedia XML Dump
  • Deliverable Models:
  • Word 'distribution
  • Word NGram distribution

Note: It would be better to extract lemma statistics.

Lang,Word,POS,Translations,Lang[edit]

  • Tool: Xpath OR Awk extractor script
  • Input: Wiktionary XML Dump
  • Language Word and POS Statistics
  • Deliverable: Language Confidence Tagger

Hyper-Ngram Analysis[edit]

Store normal and staggered ngrams per word. Analyze ngram distribution.


Cross-language w:Phonotacticsdistribution[edit]

should map to a cross lingual Phonotacticsal distribution.

Native/Loan Language distributions[edit]

since each language has a reduced phonetic subset the distribution can be corrected on a language by language basis.

three groups would appear

unavailable phonemes and their combinations. native phonemes. Phonemes and combination in the core language. loan phonemes (used exclusively in word of foreign origin).