User:OrenBochman/Search/NLP Tools/Morphology

Search NG Project
Todo List	Operational Plan	Test Plan	Risk Assessment
NG Search Spec	Search NG Analytics	NLP Tools	Search Tools
Search Labs	Configuration	Lucene-search Spec	Old Code Review
Q&A

Tasks[edit]

Data set: <Word,Language,POS>[edit]

Generate Via	Xpath OR jawk extractor script
Input:	Wiktionary XML Dump

Possible Output Built From This Dataset[edit]

Data	Deliverable	Via	Note
<Unicode-Letter,Language>	distribution	unicodeCounterModel
<Word,Language>	distribution	Mallet
<token,POS>\|language	distribution	Mallet
language\|token	discriminator	Mallet	offers all choices
language\|<token>	discriminator	dynamic programing (Forwards Backwards)	offers best choice,
POS\|Token	pre-tagger	Mallet	offers all choices,
POS\|<token,language>	discriminator	dynamic programing (Forwards Backwards)	offers best choice,
MD5\|token</nowiki>	calculator	Java class	store in the DB for HTML dump lookup
token\|{NGram}	distribution	SIRLM or Lucene Ext	all NGRAM
token\|{HNGram}	distribution	Customized Lucene Ext	all HNGRAM
d(X\|token)	spelling checker	Lucene	NGram distance

Each of this models will require a script to build. If a much better solution is available down the line incorporating unavailable unavailable feature vectors these simplistic models may be skipped to reduce production time. The same would be true for models which do not have token/lemma frequency data as derived from a normalized corpus.

Wiktionary contain extinct languages (Old English) and very limited languages. These entries need to be marked and then Excluded from analysis to reduce noise in the models.

Data set: <Language,POS,Word,Forms>[edit]

Generate Via	<Hyper-Ngram closure,residue> clustering algorithm.
Input:	Wiktionary HTML Dump

a lemma is:

LEM₁ a citation form and a list of inflected forms
LEM₂ a citation form and a list of <inflected forms, morphological state>
LEM₃ a citation form and a list of <inflected forms, morphological state, frequency count>

a good lemma-analyzer is essential to provide good recall and precision in morphologically rich language.
a single token lemma analyzer could be built from a lemma database.
better results can be achieved by considering adjacent tokens or via other linguistic considerations.
a learning component would be useful in processing unknown tokens in one of two ways

the description used in the database could be such that it could accept rare inflections that follow known regular transformation. (template/reside); (stem/affix) etc
it could try to assign unknown lemmas to a subset if forms eliminating options as more evidence appears. (maximum entropy model)

<Hyper-Ngram closure,residue> clustering algorithm[edit]

the algorithm would adjust its own parameters according to the morphology detected
the algorithm should bootstrap morphological states via a manually tagged template.
the algorithm would bootstrap morphological data from table of morphological forms. (supervised learning).
the algorithm would be able to generalize for unknown terms to their full lemma form. (unsupervised learning).
it would collect ngram and hngrams.
words would be clustered by (h)ngram similarity (nearest k neighbors)
hngrams spans would be listed by order of increasing generality or decreasing specificity.
(cf local information and global entropy)
residues extracted from non specific spans.
clustering can now proceed in earnest - two dimensional clustering by residue, by span's ngrams. (template similarity)
to further increase generality of such a model one replaces consonants a vowels with equivalencies (clustering by phonomic types)

Possible Output Built From This Data-set[edit]

Data	Deliverable	Via	Note
<Lemma,Language>	distribution	Mallet
<Lemma,POS>\|language	distribution	Mallet
language\|lemma	discriminator	Mallet	offers all choices
language\|<lemma>	discriminator	dynamic programing (Forwards Backwards)	offers best choice,
POS\|Token	pre-tagger	Mallet	offers all choices,
POS\|<token,language>	discriminator	dynamic programing (Forwards Backwards)	offers best choice,
MD5\|token</nowiki>	calculator	Java class	store in the DB for HTML dump lookup
token\|{NGram}	distribution	SIRLM or Lucene Ext	all NGRAM
token\|{HNGram}	distribution	Customized Lucene Ext	all HNGRAM
d(X\|token)	spelling checker	Lucene	NGram distance

Each of this models will require a script to build. If a much better solution is available down the line incorporating unavailable unavailable feature vectors these simplistic models may be skipped to reduce production time. The same would be true for models which do not have token/lemma frequency data as derived from a normalized corpus.

Wiktionary contain extinct languages (Old English) and very limited languages. These entries need to be marked and then Excluded from analysis to reduce noise in the models.

<Word, Language, POS, Count>.csv[edit]

Tool: Xpath OR Awk extractor script
Input: XX.Wikipedia XML Dump
Deliverable Models:

Word 'distribution
Word NGram distribution

Note: It would be better to extract lemma statistics.

Lang,Word,POS,Translations,Lang[edit]

Tool: Xpath OR Awk extractor script
Input: Wiktionary XML Dump
Language Word and POS Statistics
Deliverable: Language Confidence Tagger

Hyper-Ngram Analysis[edit]

Store normal and staggered ngrams per word. Analyze ngram distribution.

Cross-language w:Phonotacticsdistribution[edit]

should map to a cross lingual Phonotacticsal distribution.

Native/Loan Language distributions[edit]

since each language has a reduced phonetic subset the distribution can be corrected on a language by language basis.

three groups would appear

unavailable phonemes and their combinations. native phonemes. Phonemes and combination in the core language. loan phonemes (used exclusively in word of foreign origin).