User:OrenBochman/Search/NLP Tools

=Introduction=

NLP (Natural Language Programming) Tools are tools geared at broadening the scope and increasing the power of search by modeling the languages in which articles are written in. They require more sophisticated algorithms greater processing power. Their capabilities can be disruptive and could revolutionize how people use/edit a wiki. For example it could be possible to use such a tool to:
 * search names phonetically or using transliteration
 * spell check an article (using a lexical spell checker augmented by a search engine spell checker)
 * search for entities (people,places,institutions,etc)
 * search and visualise memes in a wiki or an article.
 * search and terms across projects and languages (cross-language search)
 * translate an article from another language.
 * understand to what extent an article overlaps with it's forign language versions.

The tasks required from the NLP tools can be categorised as follows:
 * Pre-processing (sentence chunking)
 * phonology (IPA anotation, tranliteration)
 * morphological (lexical to lemma normalization; pos tagging)
 * syntactic (sentence parsing)
 * semantic (word-sense disambiguation, cross language processing)
 * higher level (named entity detection, relation detection, ontological tagging, reasoning, sentiment)

The Corpus Component
Since producing a corpus is the first step for NLP data exploration some kind of tool. The area of corpus linguistics raises the following concerns:
 * corpus size required for the tasks at hand.
 * quality of corpus (noise to signal ratio).
 * format.

Since the NLP workflow will produce great insights into the documents, it will be possible to create superior Nth-generation corpus. How significant this iteration is to be seen - it would be a blessing to harness the benefits while minimizing the reprocessing this entails.

Extracting wiki content would require plenty cleaning up. Corpus should be fully automated.

An alternative to corpus production would be to create a filter chain which would work directly on the compressed dumps. By going in this direction would make integration of other tools like R_(programming_language) sirlm apertium impossible to integrate. This direction


 * For lexical processing a list of sentences would be enough.
 * For Semantic processing paragraphs and document would be useful too.
 * For Cross language processing one would like sample and combine text from different languages.

Simple Corpus format:

# sentence. sentence? sentence. sentence. sentence. sentence. #  sentence sentence sentence #
 * 1) 

Corpus Size & Quality
It would be possible to generate a much larger corpus by scanning the dump with old edits. However in such a case one would be more interested in expanding the corpus with sentences which add contribute lexical information.

Ideally the files should aggregate document of similar quality and size so that it would be possible to process the corpus in parallel and to process the top files or the bottom files.

Corpus Based NLP Work Flow

 * 1) Dump to Plain Text
 * 2) Sentence Splitter (requires a Language dependent Sentence Chunker)
 * 3) Word Normalization (requires a Language dependent Word Chunker - zwnj)
 * 4) Tag Stop Words
 * 5) Tag Named entities using n-gram analysis of titles references disambiguation and inter-wiki-links.
 * 6) Morphology
 * 7) Induct a morphology
 * 8) Tag morphological features
 * 9) *lemma-ID
 * 10) *suffix
 * 11) *lexical category Part of speech or full morphological state
 * 12) second pass for named entities
 * 13) induct co-locations & compounds
 * 14) second pass tagging of morphological features
 * 15) Induct a probabilistic grammar or use an existing one framework
 * 16) Use Grammar to parse sentences.
 * 17) Resolve cataphorical references.
 * 18) Tag logical aspect of sentences
 * 19) High Level Analysis:
 * 20) Bad grammar, spelling mistakes, bad writing style.
 * 21) Tag N.E. (third pass)
 * 22) Tag N.E. relations.
 * 23) Merology, Ontology tagging
 * 24) Sentiment Analysis.
 * 25) Reason and Extract high level knowledge into database.

For cross language processing Multi-documents from different wikis would have to be united using their inter-wiki information

Corpus Analysis Packages
Since R now provides tools for doing corpus linguistics. The main advantage of R is that it has the stronger statistical capabilities built in. A suitable tools for converting Wikipedia dump to a corpus should be found or developed. From this three immediate deliverables could be produced


 * Frequency lists - Words, Shingels, Phrases.
 * Co-locations - phrases that are more than the sum of their words.
 * Concordances - KWIC Databases.

what else is of interest ?

= Wiktionary Analyzer =

The Lexical workflow described above should produce data which would be loaded into a thread-safe Lucene analyzer.


 * Specialized MediaWiki projects &mdash; (named entity)
 * The Wiktionary projects &mdash; (cross lingual Wordnet) (One wordnet to rule them all)
 * The Wikipedia projects &mdash; (inter-wiki cross language)
 * Hunspell Project &mdash; (spelling,thesaurus,hyphenation,grammar)
 * apertium (monolingual dictionaries,bilingual dictionaries,morphologies)
 * finite state morphology http://sourceforge.net/projects/hfst/

Provide a better, faster, smarter search across the Wikimedia project. (Suggestions, spellings, corrections, etc)

The analysis would be deeper. To be done in sensible time development would require
 * Wikimedia project dumps (Openzim or xml) source and HTML
 * Hadoop cluster running advanced Mahut algorithms(SVM,LDA and others)
 * Iterative production of more powerful Lucene analyzer. bootstrapped via simple scripts followed by unsupervised learning cycles (to complete the picture)
 * Since these jobs could easily become intractable (though bugs, bad algorithms)
 * running dev job on wiki subsets
 * job length and current progress/cost estimation are design goals.

Lexical Data

 * 1) Lemma Extraction
 * 2) Scan English Wiktionary at the POS section level.
 * 3) Extract
 * 4) Lang,
 * 5) POS,
 * 6) Lemma information,
 * 7) Glosses
 * 8) co-location
 * 9) proper names
 * 10) silver bullets
 * 11) Entropy reducing heuristic -> to induct missing each lemma from free text.
 * 12) Inducting word-sense based on Topic Map 'Contexts'.
 * 13) Introducing a disambiguation procedure to Semantic/Lexical/Entities.


 * Collatinus, an open-source lemmatiser for latin language in OCaml
 * Lemmatizer.org — an open-source lemmatizer of English and Russian languages in C/C++
 * MorphAdorner , open-source lemmatiser for English Java
 * A lemmatizer for Spanish language

Semantic Data

 * 1) Word Sense enumeration - Via Grep
 * 2) Word Sense context collection Via Mahut SVD
 * 3) Word Sense context description(req an algorithm)
 * 4) Word nets - synonyms,antonyms,
 * 5) Entity type
 * 6) Categories

Entity Data
The idea is to generate a database of entities referenced in Wikipedia. Entities are

Bootstrapping via article headers and categories. Once the most obvious entities are listed one proceeds to train classifiers to find the remaining entities via NLP.


 * 1) category Based Classification of:
 * 2) People, organizations, companies, bands, nations, imagined, dead etc
 * 3) Animals, Species, etc
 * 4) Places, counties, cities,
 * 5) Dates, Time Lines, Duration
 * 6) Events, Wars, Treaties, films, awards,
 * 7) Chemicals, medicine, drugs
 * 8) commodities, Stocks etc
 * 9) Publications, Journals, periodicals, Citations etc
 * 10) External Web locations.
 * 11) Unsupervised acquisition of More entities
 * 12) Train a SVM classifiers Mahut via Hadup using tagging/parsing low ambiguity snippets referencing wide choice of terms.
 * 13) acquire more entities.
 * 14) Crosswikify Top entities.
 * 15) Cross wiki links.
 * 16) Run Mahut LDA via Haddop on Articles/Sections with correlated entities.

Etymology
Etymologies are not directly important for indexing. They can provide some extra information relating lexemes across languages. However this information might be useful as a secondary source of information for confirming a lexical hypothesis

It should be possible to order lexeme by a time based hierarchy. could be useful to compress the cross language word tree and offer insights into how irregularity are introduced into languages overtime.
 * Loan Words
 * Assimilation
 * Grim's Laws (Vowel and consonant shifts)

For IR these type of analysis is of little practical use due to
 * rapid phonological
 * semantic divergence. (e.g. English Knight mounted cavalry is a German loan word where it means infantry)

If we find etymological relation in one language it should imply a pattern for other languages
 * 1) would require a model (loan, analogy, emphasis, assimilation, grimm's law).
 * 2) requires/implies a phonological distance, semantic distance, word sense.
 * 3) requires/implies a graph of time, language, location.
 * 4) historical linguistics rules could be used to refine such a model.

MT translation Data
This is not high priority deliverable since its utility is doubtful.
 * Offline bilingual dictionaries may be of interest. c.f. wikt:en:User:Matthias Buchmeier/trans-en-es.awk
 * As wiktionaries improve they could become a significant contribution where statistical methods fall short.
 * Wikipedias clearly contain large volume of text for generating statistical language models.

Filing in the gaps
During analysis it may be able to do some extra tasks.
 * 1) Multilingual context sensitive spell-checking the wikis. Both offline and online.
 * 2) foreign Language Learning spelling dictionary
 * 3) Identify "missing templates" requires lemma to template mapping data struction and a generalization algorithm
 * 4) Identify "missing pages/section" in the wiktionary.

Language Instruction
Language Instruction would benefit from a database of about language pairs or groups: [i.e. it could help chart an optimal curriculum for teaching a nth language to a speaker of n-1 languages by producing a graph of list least resistance.


 * Top Frequency Lexical Charts
 * Topical Word Lists
 * Lexical problem areas
 * Word Order (requires lemma-n-gram frequencies)
 * Verbal Phrase/ Verbal Complement Misalignment.

Compression

 * 1) frequency information together with lexical data can be used to make a text compressor optimized for a specific wiki.
 * 2) this type of compressed text would be faster to search.

Tagger/Parser
the lexical data + frequency + n-gram frequency could be used to make a parametric cross-lingual parser.

Machine Translation
see also Wikipedia_Machine_Translation_Project


 * 1) export data into apertium format. cf http://wiki.apertium.org
 * a morphological dictionary for language xx called apertium-sh-en.sh.dix, apertium-sh-en.en.dix etc. which contains the rules of how words in language xx are inflected.
 * Bilingual dictionaries which contain correspondences between words and symbols in the two languages. called: apertium-sh-en.sh-en.dix
 * language xx to language yy transfer rules: this file has rules for how language xx will be changed into language yy. In our example this will be: apertium-sh-en.sh-en.t1x
 * language yy to xx language transfer rules: this file has rules for how language yy will be changed into language xx. In our example this will be: apertium-sh-en.en-sh.t1x


 * apertium likes [w:FSM]s so it could be possible to adapt its morphological data into an efficient spell-cheking dictionary.
 * it format may able to support collocation.
 * it does not seem to have a word-sense notion.

Crowdsourcing Wiktionary
Using existing categories, templates, the semantic wiki extension, and some scripts one could
 * automate generation of morphological information.
 * again it is possible to automate generation of bi-lingual information from wiktionary pages

Topological Translation Algorithm

 * Statistical translation works by matching parts of sentences. (This has many problems)
 * Requires a large parallel corpus of translated texts. (not available)
 * Assumes that words in a N-gram of words operate. Some languages have free word order.
 * In reality statistical lexical data is sparse.

My idea is to develop a:
 * topological algorithm for wikis
 * based on related documents and their revisions.
 * can use non-paralel categorised sets of translation sets of documents.
 * generates seme lattices I.E. cross lingual semantic algorithm
 * (semes N-nets)
 * with (morphological-state)
 * morpho-syntactic Lagrangian for MT.


 * the lattice should converge due to product theorem


 * 1) Translation matrix
 * 2) maps source word-sense to a target word-sense.
 * 3) translate cross language.
 * 4) simplify a single language text.
 * 5) make text clearer via a disambiguate operation
 * 6) Numbered list item

Unsupervised Morphology Induction
Parametric version unsupervised induction of morphology based on [GoldSmith]

Parameters
comma separates equivalents, dot separates entries
 * 1) ALPHA_FULL: [a,A.b,B. ... σ,sz,SZ. ... z,Z]
 * 2) ALPHA_VOWL: [a.e.i.o.u]
 * 3) Derive:
 * 4) * ALPHA_SIZE
 * 5) * CONS_COUNT
 * 6) * VOWL_COUNT
 * 7) BOOTSTRP_MIN_STEM_LEN = 5
 * 8) BOOTSTRP_SUCCESOR_SIGNICICANCE = 25
 * 9) BOOTSTRP_SIGN_SEED = (2,2) two suffix of length two
 * 10) BOOTSTRP_NULL_LEN = 2
 * 11) INDUCT_PREFIX
 * 12) SPLIT_COMPOUNDS
 * 13) assigns stems n-grams signature
 * 14) use the signature to detect compounds

Algorithm

 * 1) Pre-process recode corpus. Cann be done automatically via min-entropy binary codes like in arithmetic coding, but these rules can override for debugability. e.g.
 * 2) * [sz<>σ] hungarian encode bi-gram as foreign uni-gram
 * 3) * [ssz<>σσ] hungarian long consonant as two foreign uni-grams
 * 4) Store words in a pat-trie
 * 5) bootstrap morphology using heuristics.
 * 6) heuristic 1: split at Li where successorFrequency(i)>= 25 and successorFrequency(i+1) == successorFrequency(i-1) == 1
 * 7) heuristic 2: seed signatures with stems having 2 suffixes of length 2
 * 8) repeat;
 * 9) evaluate signature entropy, and morphology entropy recode signatures for min entropy and max robustness (same thing).
 * 10) generalize for unknown words.
 * 11) generalize for unknown stems.
 * 12) generalize for unknown suffixes.

Data

 * 1) trie word list (or AWG)
 * 2) atoms sorted by frequency
 * 3) stems
 * 4) suffixes
 * 5) prefixes
 * 6) signatures listed by robustness

improvements

 * 1) solve aliphony using Wiktionary
 * 2) solve aliphony using PLSA or TOPIC map variant
 * 3) collect lemmas database from templates categories. (Template extraction)
 * 4) Using hand-built table minor, SED (template) structured text (wiktionary) and free-text miner (Wikipedia) collect
 * 5) map templates to morphological-state via a "model".
 * 6) maprduce via mahut

Lemma Mining
A couple of themes in searching for lemmas. 1. geometry. (Hamming for equal length) 2. N-gram.


 * 1) A bootstrap known knowledge.
 * 2) Lemma base.
 * 3) unknown base.
 * 4) parallel multi-pattern matches
 * 5) affix mode

other iteration modes:
 * 1) Levinstien clustering using spheres round/between existing lemma and lemma members. lets say R=d(L1,M1) is the max distance between any two known lemma representatives. Words equidistant from L or M within R/2 are lemma candidates.
 * 2) generalizing the Levinstien distance to complex distance (we could map strings to sort be predictive.

Lemmatiser + Generalized Morphological Analyzer
Based on the above morphology data structures annotate a given token with the following token_type: The lemma analysis should go into payloads to be used in downstream analysis tasks.
 * UNRECOGNISED
 * PROPER or NAMED_ENTITY - IPA transliteration, entity_type
 * LEMMA - lemmaId or stem or Ngram signature, part of speech, morphological state, Word Sense disambiguation, cross language seme id.

Cross-Lingual Morphological Annotation
Can come in different flavours.
 * Numeric - little effort to organize the lemma members, only enumerate them in a consistent manner. They may be arbiterily clustered to provide a minimum description length.
 * Linguistic - bootstrapping the numeric mode with annotation enumerating morphological features. Also an order can be used to arrange the states into clusters.
 * Cross-Lingual same as before but the annotations is a state generalised across many languages. Each language will use a subset of the Cross-Lingual depending on its morphological parameters.

Note: these annotations format (should) be structured to allow simple reg-ex extraction.

e.g. the hungarian has:

Data-Structures
Once such a description is available it can be used to define states and signatures.
 * A state is the full morphological description of an inflected word.
 * A signature as explained above is a relation associating pos with permitted sates
 * Transitive Verb 

Uses

 * 1) IR. By supplying a lemma-id and a morphological-state one provide superior search capabilities in certain languages.
 * 2) A cross-language morphological description facilitates analysis from a more generalised forrm (Generic parametrized algorithms)
 * 3) A cross-language annotations would be useful for machine translation. In this sense morphology is viewed as a semi group generated as a Cartesian product of its feature subsets.
 * 4) Compression. Since in reality feature availability varies across languages matrix is not only sparse but also within a language feature availability is co-dependent. (a verb has time but a noun does not) Therefore the minimal sparse matrix can be extracted and used to compress morphological state. But to create such a compression scheme it is necessary to collect statistics showing lemma frequency, and feature dependency.

=Phonological Indexing= Phonological Indexing is the notion of also indexing the sounds of words. Naively speaking if all the text is indexed phonologically, as is the query, it should permit to search using voice. However in reality there are many complexities and one would prefer to restrict sound to where it can make the greatest impact. This boils down to words whose sound representation is most important - generally names or in English, proper nouns.

using Wiktionary data, information in Illustrations of the IPA it could be possible to provide phonological transcription for many languages. These could be used in two ways.


 * (cross language) sound based search of proper nouns.
 * search via voice query for mobile devices
 * generation of audio version of articles in various language using synthetic voices.

Transcription
Writing systems are categorized into two main type - Logo-graphic and Phonetic. Logo-graphic systems generally provide a glyph per word. In contrast Phonetic systems use a much smaller set of symbol which generally reflects the phonology of the language at a certain time. In some cases language reform keeps spelling inline with development in phonology. In other cases the alphabet contains historical artifacts from by gone erase (English)

Using a phonological description of a language it is possible to provide a set of context sensitive production rules which can be used to provide a transcription of the language. Once a Phonological description of the corpus/lexicon is available it can be used to improve search. This phonological rules, lexicon, and morphological productions can be used to make a Finite State Model of the language. These are highly compact and efficient description of the morphological lexicon which can both analyze a word or or generate a given state from the citation form. For such languages generating an IPA should be possible. While developing FST morphologies is an expert task Apertium has such models and if integrated they could be used in search.

To provide an IPA Transcription

IPA has broad and narrow transcription styles. The narrow allows a higher degree of accuracy in the description.
 * 1) . create a text to IPA transcription grammar
 * 2) . create an interpreter
 * 3) . allow override IPA from database
 * 4) create test sets based on illustrations of the IPA


 * 1) . allow deletion of certain sound patterns based on user's preferences.
 * 2) * Sound Search Preferences will be set using a sequence of questions based on minimal set
 * 3) ** do you pronounce cot and caught in the same way?
 * 4) ** do you pronounce r in new york
 * 5) plug these values into a similarity for phonology

Phonological Similarity Algorithm
The algorithmic challenge would be to design an IPA based similarity and ranking. Soundex, Double Metaphone
 * that is robust to small changes in input (spelling)
 * respects the sound difference of input and index languages.
 * respect input transliteration from different languages.
 * gracefully degrades precision. (e.g. prefer a narrow match, a broad match, a clustered phoneme match, edit distance for typos.

MDL methodology
Collect and encode information based on its ability to reduce description length.


 * (sparse) phonotactic matricies. Two models -
 * boolean binary - generate a table with + indicating pair availability
 * probablistic model
 * lexicon norming
 * corpus norming via power law
 * foreign influence via cumulative statistics based on 10 partitions of the most frequent 50,0000 words.
 * (Subject to scaling by corpus size.


 * General phonotactics matrix.
 * lexeme initial, medial ,final position  and  pairs.
 * Cluster phonotactics matrix.
 * morpheme initial, medial ,final position  and  pairs.


 * Sound Assimilation rules (also useful for FSM creation).

=See Also=

Dump Related
Some Options from apertium
 * Tagger training - Creating_a_corpus
 * Building dictionaries - Wikipedia dumps
 * Calculating Coverage

Some Directories

 * http://www.linguistics.ucsb.edu/faculty/stgries/other/links.html
 * http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/nlp_tools.html

Corpra

 * EuroParl &mdash; http://www.statmt.org/europarl/ &mdash; EU12 languages up to 44 million words per language
 * JRC-Acquis &mdash; http://langtech.jrc.it/JRC-Acquis.html &mdash; EU22 languages
 * Southeast European Times &mdash; http://xixona.dlsi.ua.es/~fran/setimes/ &mdash; English,Turkish,Bulgarian,Macedonian,Serbo-Croatian,Albanian,Greek,Romanian &mdash; 9,000 approx. paragraph aligned, 90,000&mdash;120,000 words.
 * South African Government Services &mdash; http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt &mdash; English&mdash;Afrikaans &mdash; 2,500 approx. sentence aligned, 49,375 words.
 * IJS-ELAN &mdash; http://nl.ijs.si/elan/ &mdash; English-Slovenian
 * OPUS &mdash; http://urd.let.rug.nl/tiedeman/OPUS/index.php &mdash; Open Source multilingual corpora
 * Open-Tran &mdash; http://www.open-tran.eu &mdash; single point of access to translations of open-source software in many languages (downloadable as SQLite databases)
 * Tatoeba Project &mdash; http://tatoeba.org/ &mdash; Database of example sentences translated into several languages.

Corpus Tools

 * The openNLP R package &mdash; http://cran.r-project.org/web/packages/openNLP/index.html
 * Corpus Catcher &mdash; http://translate.sourceforge.net/wiki/corpuscatcher/index - Bootstrap corpora from the web
 * BootCaT &mdash; http://sslmit.unibo.it/~baroni/bootcat.html - Simple Utilities to Bootstrap Corpora and Terms from the Web
 * Bitextor &mdash; http://sourceforge.net/projects/bitextor/ - Bootstrap bilingual corpora from the web


 * Software Tools for NLP

Concordancing Software

 * http://www.antlab.sci.waseda.ac.jp/antconc_index.html
 * http://www.edict.com.hk/pub/concapp
 * http://rd.vector.co.jp/soft/dl/win95/util/se027330.html
 * http://www.kwicfinder.com/KWiCFinder.html
 * http://corpussearch.sourceforge.net/
 * http://www.textworld.com/scp/

=References=