User:OrenBochman/Search/NLP Tools/Morphology

Possible Output Built From This Dataset
Each of this models will require a script to build. If a much better solution is available down the line incorporating unavailable unavailable feature vectors these simplistic models may be skipped to reduce production time. The same would be true for models which do not have token/lemma frequency data as derived from a normalized corpus.

Wiktionary contain extinct languages (Old English) and very limited languages. These entries need to be marked and then Excluded from analysis to reduce noise in the models.

Data set: 
a lemma is:
 * LEM1 a citation form and a list of inflected forms
 * LEM2 a citation form and a list of 
 * LEM3 a citation form and a list of 


 * a good lemma analyzer is essential in improve recall and precision in morphological rich language.

The morphological analyzer is built from a lemma database and a generalizing component.
 * the algorithm would adjust its own parameters according to the morphology detected
 * the algorithm should bootstrap morphological states via a manually tagged template.
 * the algorithm would bootstrap morphological data from table of morphological forms. (supervised learning).
 * he algorithm would be able to generalize for unknown terms to thier full lemma form (unsupervised learning).

Possible Output Built From This Data-set
Each of this models will require a script to build. If a much better solution is available down the line incorporating unavailable unavailable feature vectors these simplistic models may be skipped to reduce production time. The same would be true for models which do not have token/lemma frequency data as derived from a normalized corpus.

Wiktionary contain extinct languages (Old English) and very limited languages. These entries need to be marked and then Excluded from analysis to reduce noise in the models.

.csv

 * Tool: Xpath OR Awk extractor script
 * Input: XX.Wikipedia XML Dump
 * Deliverable Models:
 * Word 'distribution
 * Word NGram distribution

Note: It would be better to extract lemma statistics.

Lang,Word,POS,Translations,Lang

 * Tool: Xpath OR Awk extractor script
 * Input: Wiktionary XML Dump
 * Language Word and POS Statistics
 * Deliverable: Language Confidence Tagger

Hyper-Ngram Analysis
Store normal and staggered ngrams per word. Analyze ngram distribution.

Cross-language Phonotacticsdistribution
should map to a cross lingual Phonotacticsal distribution.

Native/Loan Language distributions
since each language has a reduced phonetic subset the distribution can be corrected on a language by language basis.

three groups would appear

unavailable phonemes and their combinations. native phonemes. Phonemes and combination in the core language. loan phonemes (used exclusively in word of foreign origin).