User:OrenBochman

Stuff

 * Google's panda to wikimania
 * Cooperate with
 * Google on NLP
 * Academia
 * Apertium
 * HFST

Summer Of Code

 * 1) corpus tools => convert full wikidump to corpus.
 * 2) markup removal (we will provide this filters)
 * 3) only use the "good" edits.
 * 4) make a corpus of bad edits. (these exist)
 * 5) make an edit classifier
 * 6) sentence boundry detection based http://nlp.stanford.edu/courses/cs224n/2005/agarwal_herndon_shneider_final.pdf
 * 7) goals precision > 98%  and recall > 80% better than 95% is excelent. (how to test these on unsupervisded date?
 * 8) maxent classifier
 * 9) feaure extraction (trigram of Prev, Current, Next)
 * 10) Prec/Next is uppercase
 * 11) Prev is all uppercase
 * 12) Prec/Next length
 * 13) Current is ':' '--' '...'
 * 14) Prev is '.' '?' '!'
 * 15) Next is all digits
 * 16) Prev is an abreviation
 * 17) Current is '.' '?' '!' next is '--' or '"'
 * 18) Current is '.' '?' '!' next not '"'
 * 19) data sets (train on corpus with sentences joined in one line with all words anotated EOA_Y or EOS_N)
 * 20) allow adding training data based on errors found by (manual) inspection.
 * 21) ~ bonus: classifing non sentence breaking punctuation.
 * 22) maxent model
 * 23) convert to xml format that is used by most other people
 * 24) table


 * 1) Lemmas to word sense
 * 2) exsiting works
 * 3) semantic frames - verb "think" (about) takes a noun complement XXX. In hungarian this is more explicit. Can be powerfull format for representing knowldge in sentences. Could be used to convert text to relation. (go, go to XXX,go from XXX to YYY) not many relations are needed. Verbs of motions, events,
 * 4) logic frames - map simple senteces to a prologu like logic structure