User:OrenBochman

Top MediaWiki IA Flaws

 * U.I. Elements aren't role-dependent. (Non admin can click on hundreds of things that won't work).
 * No UI widgets format - means that extensions are either
 * tag based
 * single page based
 * have no ui.
 * modify the existing ui in complicated ways (steeper learning curve).
 * Parser
 * There is direct Access to the parser.
 * The parser is not really a parser but a set of transformation.
 * Watch is limited. (does not support follow up time based followup action)
 * Most Policy should be built into the software. Not in hundreds of unsalable bots, or enforced by armies of administrators & editors.
 * Talk pages are primitive and lacking basic social features for interpersonal communication. (So people roll their own inferior features)
 * Signatures should be automatic.
 * Discussions should be threaded (this actually exists but it is built on top of talk pages)
 * No formal relations - friends/collaboration groups.
 * No avatars - identities are highly non social.
 * No Private/Alternative communications network (IM,Email,Messages,VOIP).
 * No blogging, social bookmarking, social games. (These are not roles considered part of Wikipedia but they would be worth integrating to increase editor engagement by developing personal spaces.)
 * No browsing history widget
 * No editing history widget (only a special page)
 * Support for Quiz Pages (Kind of works).
 * History - all subsequent edits by a single editor should be merged into one.

SOLR
security:

Stuff

 * Google's panda to wikimania
 * Cooperate with
 * Google on NLP
 * Academia
 * Apertium
 * HFST

Summer Of Code

 * 1) corpus tools => convert full wikidump to corpus.
 * 2) markup removal (we will provide this filters)
 * 3) only use the "good" edits.
 * 4) make a corpus of bad edits. (these exist)
 * 5) make an edit classifier
 * 6) sentence boundry detection based http://nlp.stanford.edu/courses/cs224n/2005/agarwal_herndon_shneider_final.pdf
 * 7) goals precision > 98%  and recall > 80% better than 95% is excelent. (how to test these on unsupervisded date?
 * 8) maxent classifier
 * 9) feaure extraction (trigram of Prev, Current, Next)
 * 10) Prec/Next is uppercase
 * 11) Prev is all uppercase
 * 12) Prec/Next length
 * 13) Current is ':' '--' '...'
 * 14) Prev is '.' '?' '!'
 * 15) Next is all digits
 * 16) Prev is an abreviation
 * 17) Current is '.' '?' '!' next is '--' or '"'
 * 18) Current is '.' '?' '!' next not '"'
 * 19) data sets (train on corpus with sentences joined in one line with all words anotated EOA_Y or EOS_N)
 * 20) allow adding training data based on errors found by (manual) inspection.
 * 21) ~ bonus: classifing non sentence breaking punctuation.
 * 22) maxent model
 * 23) convert to xml format that is used by most other people
 * 24) table


 * 1) Lemmas to word sense
 * 2) exsiting works
 * 3) semantic frames - verb "think" (about) takes a noun complement XXX. In hungarian this is more explicit. Can be powerfull format for representing knowldge in sentences. Could be used to convert text to relation. (go, go to XXX,go from XXX to YYY) not many relations are needed. Verbs of motions, events,
 * 4) logic frames - map simple senteces to a prologu like logic structure