User:OrenBochman/Search/NLP Tools

= Wiktionary Analyzer = Goal Produce a Lucene analyzer which based on data extracted from Provide a better, faster, smarter search across the wikimedia projet. (Suggestions, spellings, corrections, etc)
 * Wiktionary projects
 * Wikipedia projects
 * a few other selected projects

The anlysis would be CPU intensive. To be done in sensible time development would require
 * an integration server (Hudson scales nicely)
 * Wikimedia project dumps (openzim or xml) source and HTML
 * Hadoop cluster runing advanced mahut algorithems(SVM,LDA and others)
 * Iterative production of stonger Lucene analyzers. Bootsraped via simple scripts followed by unsupervised learning cycles (to complete the picture)
 * Since these jobs could easily become intracable (though bugs, bad algorithems)
 * running dev job on wiki subsets
 * job lenght and current progress/cost estimation are design goals.

Lexical Data

 * 1) Lemma Extraction
 * 2) Scan English wixtionary at the POS section level.
 * 3) Extract
 * 4) Lang,
 * 5) POS,
 * 6) Lemma information,
 * 7) Glosses
 * 8) co-location
 * 9) proper names
 * 10) silver bullets
 * 11) Entropy reducing heuristic -> to induct missing lemmas from free text.
 * 12) Inducting word sense based on Topic Map 'Contexts'.
 * 13) Introducing a disambiguation procedure to Semantic/Lexical/Entities.

Semantic Data

 * 1) Word Sense enumeration - Via Grep
 * 2) Word Sense context collection Via Mahut SVD
 * 3) Word Sense context description(req an algorithem)
 * 4) Word nets - synonyms,antonyms,
 * 5) Entity type
 * 6) Categories

Entity Data
The idea is to generate a database of Enteties referenced in Wikipedia. Enteties are

Bottstraping via article headers and catagories. Once the most obvious enteties are listed one proceeds to train classifiers to find the remaining enteties via NLP.


 * 1) Catagory Based Classification of:
 * 2) People, orgonizations, companies, bands, nations, imagined, dead etc
 * 3) Animals, Species, etc
 * 4) Places, counties, cities,
 * 5) Dates, Time Lines, Duration
 * 6) Events, Wars, Treaties, films, awards,
 * 7) Chemicals, medicine, drugs
 * 8) Comodeties, Stocks etc
 * 9) Publications, Journals, Periodiacals, Citations etc
 * 10) External Web locations.
 * 11) Unsupervised aquisition of More Enteties
 * 12) Train a SVM classifiers Mahut via Hadup using tagging/parsing low ambiguity snippets referenceing wide selection of terms.
 * 13) Aquire more enteties.
 * 14) Crosswikify Top enteties.
 * 15) Cross wiki links.
 * 16) Run Mahut LDA via Haddop on Articles/Sections with corolated eteties.

Etymology
If we trust etymologies in one language we could suggest them for others.
 * 1) would require a model (loan, analogy, language change).
 * 2) requires/implies a phonological distance, semantic distance, word sense.
 * 3) requires/implies a graph of time, language, location.
 * 4) historical linguistics rules could be used to refine such a model.

MT Translaion Data
This is not high priority deliverable since its utility is doubtful.
 * Offline bilingual dictionaries may be of interest. c.f.
 * As wiktionaries improve they could become a significant contibution where statistical methods fall short.
 * Wikipedias clearly contain large volume of text for generating statistical language models.

Filing in the gaps
During analysis it may be able to do some extra tasks.
 * 1) Multilingual context sensitive spell-checking the wikis. Both offline and online.
 * 2) Forign Language Learning Spellng dictinary
 * 3) Identify "missing templates" requires lemma to template mapping data struction and a generalization algorithm
 * 4) Identify "missing pages/section" in the wiktionary.

Language Instruction
Language Instruction would benefit from a database of pertaining to language pairs or groups: [i.e. it could help chart an optimal curriculum for teaching a nth language to a speaker of n-1 languages by producing a graph of list least resistance.


 * Top Frequency Lexical Charts
 * Topical Word Lists
 * Lexical problem areas
 * Word Order (requires lemma-n-gram frequencies)
 * Verbal Phrase/ Verbal Complement Misalignment.

Compression

 * 1) frequency information together with lexical data can be used to make a text compressor optimized for a specific wiki.
 * 2) this type of compressed text would be faster to search.

Tagger/Parser
the lexical data + frequency + n-gram frequency could be used to make a parametric translingual parser.

Machine Translation
see also Wikipedia_Machine_Translation_Project


 * 1) export data into apertium format. cf http://wiki.apertium.org
 * a morphological dictionary for language xx called apertium-sh-en.sh.dix, apertium-sh-en.en.dix etc. which contains the rules of how words in language xx are inflected.
 * Bilingual dictionaries which contain correspondences between words and symbols in the two languages. called: apertium-sh-en.sh-en.dix
 * language xx to language yy transfer rules: this file has rules for how language xx will be changed into language yy. In our example this will be: apertium-sh-en.sh-en.t1x
 * language yy to xx language transfer rules: this file has rules for how language yy will be changed into language xx. In our example this will be: apertium-sh-en.en-sh.t1x


 * apertium likes [w:FSM]s so it could be possible to adapt its morphological data into an efficient spellcheking dictionary.
 * it format may able to support collocation.
 * it does not seem to have a wordsense notion.

Crowdsourcing Wiktionary
Using existing categories, templates, the semantic wiki extension, and some scripts one could
 * automate generation of morphological information.
 * again it is possible to automate generation of bi-lingual information from wiktionary pages

Topological Translation Algorithm

 * Statistical translation works by matching parts of sentences. (This has many problems)
 * Requires a large parallel corrpus of translated texts. (not available)
 * Assumes that words in a N-gram of words operate. Some languages have free word order.
 * In reality statistical lexical data is sparse.

My idea is to develop a:
 * topological algorithm for wikis
 * based on related documents and their revisions.
 * can use non-paralel categorised sets of translation sets of documents.
 * generates seme lattices I.E. cross lingual semantic algorithm
 * (semes N-nets)
 * with (morphological-state)
 * morpho-syntactic Lagrangian for MT.


 * the lattice should converge due to product theorem


 * 1) Translation matrix
 * 2) maps source wordsense to a target wordsense.
 * 3) translate cross language.
 * 4) simplify a single language text.
 * 5) make text clearer via a disambiguate operation
 * 6) Numbered list item

Algorithms

 * 1) Semi-supervised acquisition of morphology.
 * 2) mine lemmas.
 * 3) collect lemmas from templates categories. (Template extraction)
 * 4) map templates to morphological-state via "model".
 * 5) entropy minimizing lemma induction via [heuristic]]s.
 * 6) from existing lemma knowledge gathered from Wiktionary postulate/induce/proof additional lemmas by induction.

Lema Mining
A couple of themes in searching for lemmas. 1. geometry. (hamming for equal length) 2. ngram.

find for select two. words M1 and M2. Other iteration mode may be more efficient after the top lemmas have been found or a together with an deleted NGRAM lookuptable or other structures.
 * 1) A boot strap known knowledge.
 * 2) Using hand built table miner, (template) structured text (wiktionary) and free-text miner (wikipedia) collect
 * 3) Lemma base.
 * 4) unknown base.
 * 5) Induct and Generalize from L ={L1 >> L2 >>,..., >> LN} (where >> indicate more frequent in corpus).
 * 6) parallel multi-pattern matches
 * 7) affix mode
 * 8) iteration mode (top frequency) - for most important lemmas and assuming correlation of frequency with morph-state

other iteration modes:
 * 1) Levinstien clustering using spheres round/between existing lemma and lemma members. lets say R=d(L1,M1) is the max distance between any two known lemma representatives. Words equidistant from L or M within R/2 are lemma candidates.
 * 2) Generlizing the levinstien distance to complex distance (we could map strings to sort be predicitive.

Generalized Morphological State
An enumeration of all possible morphological states (in all languages). Each language has a sparse subset depending on it's morphological parameters.

e.g. the hungarian has:

Uses

 * 1) Enumeration Develop a language independent view of morphology with chief application in MT. In this sense morphology is viewed as a semi group generated as a Cartesian product of its feature subsets.
 * 2) Compression. Since in reality feature availability varies across languages matrix is not only sparse but also within a language feature availability is codependent. (a verb has time but a noun does not) Therefore a (minimal) sparse matrix can be extracted and used to compress morphological state. But to create such a compression scheme to be created it is necessary to collect statistics showing lemma frequency, and feature dependency.
 * 3) IR. By supplying a lemma-id and a morphological-state one provide superior search capabilities in certain languages.