User:OrenBochman/Search/NLP Tools/Morphology

Word,Language,POS

 * Output built from this data:


 * String language discriminator by dynamic programing using a string tokens language probabilities. Via Forward Backwards alg.
 * multilingual POS pre-tagger (offers all tags).
 * Single Language POS tagger by (via marginal distribution).
 * Word,MD5 index.
 * Token NGram Feature Vector. via Mallet
 * Token HyperNGram Feature Vector. via Mallet

Note: some of these models could be made more powerful by introducing within language frequencies as derived from a normalized corpus. However it would even better to do this via

.csv

 * Tool: Xpath OR Awk extractor script
 * Input: XX.Wikipedia XML Dump
 * Deliverable Models:
 * Word 'distribution
 * Word NGram distribution

Note: It would be better to extract lemma statistics.

Lang,Word,POS,Translations,Lang

 * Tool: Xpath OR Awk extractor script
 * Input: Wiktionary XML Dump
 * Language Word and POS Statistics
 * Deliverable: Language Confidence Tagger

Hyper-Ngram Analysis
Store normal and staggered ngrams per word. Analyze ngram distribution.

Cross-language Phonotacticsdistribution
should map to a cross lingual Phonotacticsal distribution.

Native/Loan Language distributions
since each language has a reduced phonetic subset the distribution can be corrected on a language by language basis.

three groups would appear

unavailable phonemes and their combinations. native phonemes. Phonemes and combination in the core language. loan phonemes (used exclusively in word of foreign origin).