User:OrenBochman

From MediaWiki.org
Jump to: navigation, search
  • Name: Oren Bochman
  • Main Project title: "Wikipedia Search"
  • Contact information:
  • my page on Wiktionary
MediaWiki-extensions-icon.svg This user is a proud MediaWiki extension developer and participant in WikiProject Extensions.

[edit] These Are A Few Of My Favourite Things

Ant [1] ANTLR [2] w:ApacheBench [3] Apertium [4] Bugzilla [5] Code Review [6]
carrot2 [7] DAWG[8] Etherpad Lite[9] Jenkins [10] Lucene[11] Maven[12]
Nutch[13] Open Relevance [14] R[15] Subversion [16] SOLR[17] Tika[18]
Translate Wiki [19] Vogella On Java[20] Wikilabs[21] UIMA[22] Solarium [23]

[edit] Quick IRC Channels Links

mediawiki mediawiki-dev mediawiki-ops wikimedia-tech wiktionary Openzim
Kiwix Lucene Solr Hadoop Nutch Semantic Media Wiki
Left.svg
Search NG Project
Todo List Operational Plan Test Plan Risk Assessment
NG Search Spec Search NG Analytics NLP Tools Search Tools
Search Labs Configuration Lucene-search Spec Old Code Review
Q&A
Right.svg



[edit] Extension Ideas

  • Latex Diagram Builder (Latex to SVG script)
    • take latext diagram in a <latexD><\latexD>
    • Outputs an SVG of the diagram.
    • Easy to do since latex can work a command line application.
  • Gambit extension
    • Take an extensive form game
    • Generate diagram
    • Generate solutions
    • Easy to do since gambit works as a command line application.
    • cannot make ess reports

[edit] Top MediaWiki IA Flaws

  • 95% of Policy should be built into the software and the rest should be optional. = other alternative wont scale. (Is this possible?)
  • No click stream analytics - dev teams are working blind as to how users are working.
  • The standard method of extending or automating should be built in. Today it is via javascripts or bots.
  • Mising role based logic - U.I. Elements & behaviour should be visibile only if they are actionable (Non admin can click on hundreds of things that won't work).
  • UI/Extension are closely linked to parser.
  • No UI widgets format - means that extensions are either
    • tag based
    • single page based
    • have no ui.
    • modify the existing ui in complicated ways (steeper learning curve).
  • Parser
    • The parser is not really a parser but a set of transformation. (this is getting fixed!)
    • There is direct Access to the parser via hooks instread of an abstracted mechanism to protect it from bad extentions.
  • Watch is limited. (does not support follow up time based followup action)

editors.

  • Talk pages are primitive and lacking basic social features for interpersonal communication. (So people roll their own inferior features)
    • Signatures should be automatic and viewing thier details should be a ui option.
    • Discussions should be threaded (this actually exists but it is built on top of talk pages)
    • No formal relations - friends/collaboration groups.
    • No avatars - identities are highly non social.
    • No Private/Alternative communications network (IM,Email,Messages,VOIP).
    • No blogging, social bookmarking, social games. (These are not roles considered part of Wikipedia but they would be worth integrating to increase editor engagement by developing personal spaces.)
    • No browsing history widget
    • No editing history widget (only a special page)
  • No Support for persitent Quiz Pages (Kind of works).
  • History - all subsequent edits by a single editor should be merged into one.

[edit] Confrences

[edit] SOLR

security: [1]

[edit] Stuff

  • Cooperate with
    • Google on NLP
    • Academia
    • Apertium
    • HFST

[edit] Summer Of Code

[edit] Lucene Lemma Analyzers based on Morphology Extraction from Wikipedia Text

  • Part 1: use & expand induction software to process exiting languages.
  1. Lemmas to word sense:
    1. exsiting works
    2. semantic frames - verb "think" (about) takes a noun complement XXX. In hungarian this is more explicit. Can be powerfull format for representing knowldge in sentences. Could be used to convert text to relation. (go, go to XXX,go from XXX to YYY) not many relations are needed. Verbs of motions, events,
    3. logic frames - map simple senteces to a prologu like logic structure
  • Part 2 extract semantic frames from (part of speech tagged) corpus.
  • deliverables:
  1. semantic networks used in wikipedia
  2. search and retrieve sample sentences for semantic frame patterns

[edit] Lucene - Automatic Query Expansion System

use SVD or other methods to make a cross language word nets

[edit] User Fingerprinting

  1. anonymous fingerprinting for:
    • free unregisterd editor contribution.
    • sock pupet detection
  • probably not a good GSOC concept

[edit] Lucene - NG Wiki Parser Filter

Integrate the cutting edge parser as a lucne filter to allow offline indexing of wiki source. Deliverable: up to date wikipedia parser. Problems - no specs Problem - templates THis will probably be one of my own projects if I get to work full time

[edit] UIMA Content Extraction From Talk Pages

Use UIMA to automate content extraction talk and user Talk Pages. This is to facilitate tracking of action on various policies. Product a Q&A system.

This is on the frnge of contetnt analytics.


[edit] Corpus Stuff

[edit] Foot notes

  1. Ant
  2. Grammars
  3. Benchmark
  4. Machine Translation
  5. QA
  6. Media Wiki's
  7. clustering
  8. data structure
  9. real time collaboration
  10. CI
  11. search lib
  12. language detection
  13. checking external links
  14. testing search
  15. Statistics & data mining
  16. source control
  17. search engine
  18. language detection
  19. translation memory
  20. tutorials
  21. testing
  22. content analytics frame work
  23. SOLR PHP integration
Personal tools
Namespaces

Variants
Actions
Navigation
Support
Download
Development
Communication
Print/export
Toolbox