Parsoid/Wikimania 2014

From mediawiki.org

Lets summarize our talk plans for Wikimania

Talk abstract[edit]

Parsoid is changing the way we can work with wiki content by representing it as equivalent and editable semantic HTML+RDFa markup. It powers the VisualEditor, but is also used by a growing number of innovative projects including the Flow discussion system, the Kiwix offline reader, and the new ContentTranslation and PDF rendering systems. In the longer term, it is on track to provide the default content representation and Wikitext user interface for MediaWiki.

In this presentation, we will illustrate some of the problems we faced while building the bi-directional conversion between Wikitext and HTML. We will show how we addressed some of them, and which limitations remain. Addressing the remaining limitations will mostly involve cleaning up broken wikitext. We will show some examples, and point out the few cases where limitations actually impact non-broken wikitext. We will also describe how we systematically test the quality of the conversion to catch issues like 'dirty diffs' early before they break pages in production, and where this testing has failed in the past.

The second part of the presentation will focus on how the HTML+RDFa format and the Parsoid API can help you write more powerful gadgets, bots, edit or data extraction tools. We will illustrate this using examples from existing projects (see list of current users). Semi-automated content translations including template adaptations is a good example for a problem that was very hard to solve on the wikitext level, but becomes tractable in HTML. It is also an example of users taking an API and building innovative tools around it. As a more hands-on example, we will demonstrate how easy it is to build a small editing gadget for micro-contributions.

Finally, we will close our presentation by talking about future plans for Parsoid and MediaWiki's content representation in general. This includes directly storing HTML+RDFa for pages to speed up the site for editors and visual diffing for a more intuitive comparison of article versions. We will also show prototypes of new ways to structure the content itself using HTML-based templating and data-driven widgets for data tables and other common page elements.

Questions to answer[edit]

  • What is the problem we are trying to solve?
  • Why is this hard? (examples!)
  • How do we address those problems?
  • How does having HTML+RDFa enable new features?
    • kiwix, PDF rendering, LintTrap, Google, translations, Flow, ...
  • How can I use this in my gadget / bot / whatever?
    • How to use the API (hopefully save API by then)
  • What are the future plans re content, templating etc?
    • Rashomon, fast page loads for logged-in users, HTML templating, ?

Outline[edit]

Here's one possible outline: (cscott)

  1. Introduction / motivating example
  2. The difficulties of wikitext
    • wikitext tarpits
    • parser codebase
    • practical issues: hard to write bots, etc
  3. Vision
    • A more standard representation
    • Editable with existing CE tools
  4. The HTML+RDFa promised land
    • Some examples: it's just HTML!
      • can use jquery to find all links, etc
    • RDFa semantic data
  5. Current applications
    • Visual Editor
    • kwix
    • PDF rendering
  6. Future applications
    • LintTrap
    • better templating
    • easier bots
    • unified storage?
  7. Community
    • how can i use parsoid? (parsoid API service, etc)