User:OrenBochman

Stuff

 * Google's panda to wikimania
 * Cooperate with
 * Google on NLP
 * Academia
 * Apertium
 * HFST

Summer Of Code

 * 1) . corpus tools => convert full wikidump to corpus.
 * 2) only use the "good" edits.
 * 3) sentence boundry detection based http://nlp.stanford.edu/courses/cs224n/2005/agarwal_herndon_shneider_final.pdf
 * 4) goals precision > 98%  and recall > 80% better than 95% is excelent. (how to test these on unsupervisded date?
 * 5) maxent classifier
 * 6) feaure extraction (trigram of Prev, Current, Next)
 * 7) Prec/Next is uppercase
 * 8) Prev is all uppercase
 * 9) Prec/Next length
 * 10) Current is ':' '--' '...'
 * 11) Prev is '.' '?' '!'
 * 12) Next is all digits
 * 13) Prev is an abreviation
 * 14) Current is '.' '?' '!' next is '--' or '"'
 * 15) Current is '.' '?' '!' next not '"'
 * 16) data sets (train on corpus with sentences joined in one line with all words anotated EOA_Y or EOS_N)
 * 17) allow adding training data based on errors found by (manual) inspection.
 * 18) ~ bonus: classifing non sentence breaking punctuation.
 * 19) maxent model
 * 20) markup removal (we will provide this filter)
 * 21) convert to xml format that is used by most other people