User:OrenBochman/Search/NLP Tools/Morphology

Possible Output Built From This Dataset
Each of this models will require a script to build. If a much better solution is available down the line incorporating unavailable unavailable feature vectors these simplistic models may be skipped to reduce production time. The same would be true for models which do not have token/lemma frequency data as derived from a normalized corpus.

Wiktionary contain extinct languages (Old English) and very limited languages. These entries need to be marked and then Excluded from analysis to reduce noise in the models.

.csv

 * Tool: Xpath OR Awk extractor script
 * Input: XX.Wikipedia XML Dump
 * Deliverable Models:
 * Word 'distribution
 * Word NGram distribution

Note: It would be better to extract lemma statistics.

Lang,Word,POS,Translations,Lang

 * Tool: Xpath OR Awk extractor script
 * Input: Wiktionary XML Dump
 * Language Word and POS Statistics
 * Deliverable: Language Confidence Tagger

Hyper-Ngram Analysis
Store normal and staggered ngrams per word. Analyze ngram distribution.

Cross-language Phonotacticsdistribution
should map to a cross lingual Phonotacticsal distribution.

Native/Loan Language distributions
since each language has a reduced phonetic subset the distribution can be corrected on a language by language basis.

three groups would appear

unavailable phonemes and their combinations. native phonemes. Phonemes and combination in the core language. loan phonemes (used exclusively in word of foreign origin).