User:OrenBochman/Search/NLP Tools/Morphology

Possible Output Built From This Dataset
Each of this models will require a script to build. If a much better solution is available down the line incorporating unavailable unavailable feature vectors these simplistic models may be skipped to reduce production time. The same would be true for models which do not have token/lemma frequency data as derived from a normalized corpus.

Wiktionary contain extinct languages (Old English) and very limited languages. These entries need to be marked and then Excluded from analysis to reduce noise in the models.

Data set: 
a lemma is:
 * LEM1 a citation form and a list of inflected forms
 * LEM2 a citation form and a list of 
 * LEM3 a citation form and a list of 


 * a good lemma-analyzer is essential to provide good recall and precision in morphologically rich language.
 * a single token lemma analyzer could be built from a lemma database.
 * better results can be achieved by considering adjacent tokens or via other linguistic considerations.
 * a learning component would be useful in processing unknown tokens in one of two ways
 * the description used in the database could be such that it could accept rare inflections that follow known regular transformation. (template/reside); (stem/affix) etc
 * it could try to assign unknown lemmas to a subset if forms eliminating options as more evidence appears. (maximum entropy model)

 clustering algorithm

 * the algorithm would adjust its own parameters according to the morphology detected
 * the algorithm should bootstrap morphological states via a manually tagged template.
 * the algorithm would bootstrap morphological data from table of morphological forms. (supervised learning).
 * the algorithm would be able to generalize for unknown terms to their full lemma form. (unsupervised learning).
 * it would collect ngram and hngrams.
 * words would be clustered by (h)ngram similarity (nearest k neighbors)
 * hngrams spans would be listed by order of increasing generality or decreasing specificity. (cf local information and global entropy)
 * residues extracted from non specific spans.
 * clustering can now proceed in earnest - two dimensional clustering by residue, by span's ngrams. (template similarity)
 * to further increase generality of such a model one replaces consonants a vowels with equivalencies (clustering by phonomic types)

Possible Output Built From This Data-set
Each of this models will require a script to build. If a much better solution is available down the line incorporating unavailable unavailable feature vectors these simplistic models may be skipped to reduce production time. The same would be true for models which do not have token/lemma frequency data as derived from a normalized corpus.

Wiktionary contain extinct languages (Old English) and very limited languages. These entries need to be marked and then Excluded from analysis to reduce noise in the models.

.csv

 * Tool: Xpath OR Awk extractor script
 * Input: XX.Wikipedia XML Dump
 * Deliverable Models:
 * Word 'distribution
 * Word NGram distribution

Note: It would be better to extract lemma statistics.

Lang,Word,POS,Translations,Lang

 * Tool: Xpath OR Awk extractor script
 * Input: Wiktionary XML Dump
 * Language Word and POS Statistics
 * Deliverable: Language Confidence Tagger

Hyper-Ngram Analysis
Store normal and staggered ngrams per word. Analyze ngram distribution.

Cross-language Phonotacticsdistribution
should map to a cross lingual Phonotacticsal distribution.

Native/Loan Language distributions
since each language has a reduced phonetic subset the distribution can be corrected on a language by language basis.

three groups would appear

unavailable phonemes and their combinations. native phonemes. Phonemes and combination in the core language. loan phonemes (used exclusively in word of foreign origin).