Parsoid/Roadmap/2014 15

From mediawiki.org

This roadmap was outlined early in FY2014-2015 and a lot changed in between. This document is outdated and is mostly of historical interest. For the most updated information please look at the current roadmap wiki page.

High level goals[edit]

  • Q1 Jul - Sep 2014:
    • Must: Likely spillover from 2013 goals: Basic language variant support
    • Must: Stable element ids (supports content translation, WT/HTML switching in VE, other features)
    • Must: Prepare to use Parsoid HTML for page view (finish up visual diffs, set up RT testing, a prototype for all page views)
    • Should: Native gallery impl
    • Should/Could: Start experimentation with content widgets: Cross-team effort (Parsoid, VE, Services, Community Engagement, ...)
  • Q2 Oct - Dec 2014: See main page for Q2 tasks.
    • Must: Stable element ids (supports content translation, WT/HTML switching in VE, other features)
    • Must: Prepare to use Parsoid HTML for page view (beta for mobile?, all page views?)
    • Must/Should: Wikilint/LintTrap
    • Should/Could: Sister projects/extension (other extensions beyond gallery)
    • Should/Could: HTML content templating: Cross-team effort (Parsoid, VE, Services, Community Engagement, ...)
    • Should/Could: Experimentation with content widgets: Cross-team effort (Parsoid, VE, Services, Community Engagement, ...)
  • Q3 Jan - Mar 2015:
    • Must: Beta/post-beta for Parsoid HTML for page view
    • Must/Should: Wikilint/LintTrap
    • Must/Should: Sister projects/extension (other extensions beyond gallery)
    • Should/Could: HTML content templating: Cross-team effort (Parsoid, VE, Services, Community Engagement, ...)
  • Q4 Apr - Jun 2015:
    • Ongoing work to support for HTML-only wikis
      • HTML content templating
      • Content widgets

Interdependencies (with other projects):[edit]

  • Rashomon / Content API is needed for html page views, stable element ids, efficient template updates.
  • VE, Flow, Mobile depend on Parsoid output
  • i18n consultation for language variant support
  • Content translation group on stable ids
  • Collaboration / interaction with editors community for wikilint/linttrap and possibly content widgets
  • Wikidata for content widgets

Constraint:[edit]

  • ~30-40% time (a somewhat arbitrary number) on maintenance and non-roadmap tasks (bug fixing, deploys, gsoc, presentations, etc.)

Details on High-Level goals[edit]

Start using Parsoid HTML for page views[edit]

  • Eliminate user-specific rendering aspects from rendered HTML and make generated HTML user-login-state agnostic
    • Parsoid team is responsible for removing the network-heavy metadata (data-parsoid) from rendered output, use metadata storage to store it and maintain a map of element ids to parsoid-specific metadata. This metadata is required for accurate serialization after edits and not for regular page views.
    • Services / platform teams will provide services / API for getting information about redlinks, user-preferences.
    • This has the side-effect of eliminating cache/storage fragmentation. Logged-in page views will still require front-end JS to fetch information about red-links, user-preferences, etc. and updating the view (this could be done by the Parsoid team or services / platform teams).
  • Service team has high-level goal of building the infrastructure for this (Rashomon, API with redlinks etc).
  • Requires parser tests to be tidy enabled.
    • Will provide better insight into rendering differences on wikipedias and most wikis where tidy is almost always enabled.
  • Requires more QA on rendering accuracy (visual diffing).
    • Will provide better insight into (in)compabilities with current rendering and scale of work.
  • Requires identifying template uses (ex: mixed part-style-content templates) that we currently don't handle/support.
  • Will likely have a long tail of rendering diffs.
  • Enforce nesting, <domparse> tag, wikitext linting.
  • PDF rendering from Parsoid HTML will be the first test of this.

Stable element ids in HTML[edit]

  • Challenge: Preserve element ids across wikitext edits
    • have ideas on this that should also improve performance for switching from wikitext to HTML
  • Supports content translation, authorship maps, possibly efficient diffing, document part retrieval
  • Important for save performance (see bug 64171)

Switching between wikitext & HTML[edit]

  • Naive implementation (maintain a lot of state in VE and pass to/from Parsoid) will not be very performant, and also conflicts with the path of stripping data-parsoid from HTML.
  • WT --> HTML switch puts Parsoid parse pipeline on the critical path.
  • HTML --> WT switch can introduce dirty diffs if data-parsoid is stripped and element ids aren't preserved across previous wt -> html switches.
  • This will build on stable-element id work to ensure that dirty diffs are not introduced across wikitext edits.

Content widgets[edit]

  • Provide alternatives for data tables (football, discographies etc)
  • Research new ways to mark up / integrate specific classes of content
  • Cross-team effort: Parsoid, Services, VE, Community liaisons at the very least.
    • For existing wikis, this requires editor community buy-in.
    • For new/HTML-only wikis, this is nevertheless a fruitful direction to experiment with.

HTML content templating[edit]

  • Cross-team effort: Parsoid, Services, VE, Community liaisons at the very least.
    • For existing wikis, this requires editor community buy-in.
    • For new/HTML-only wikis, this is nevertheless a fruitful direction to experiment with.
  • Goals: performance, possibly client-side rendering / preview
  • Can build on knockoff
  • Need to investigate if this can be represented transparently using existing transclusion syntax

HTML-only wiki support[edit]

  • Cross-team effort: Parsoid, Services, Platform, Features, VE
  • requires everything that's needed for Parsoid-HTML page views & HTML-based templating, and possibly content widgets
  • HTML diffs, abuse filter support, others?

"Sister projects" / extensions[edit]

  • section support / LST
  • language variants
  • Native gallery port
  • One other extension relevant to a sister project? Quiz?

Notes: A laundry list of tasks[edit]

This is more of a laundry list of tasks not all of which show up in the earlier sections. This can be fleshed out more and also used to figure out how much time / resources we will spend on these tasks. This need not be part of the final roadmap, but should be there somewhere for us to have an overview of all that needs to get done. This could even be folded into the previous section, if need be.

Functionality[edit]

  • Support for language variants
  • Support for wikitext
    • Scope range of transclusions. Two notions to deal with:
      • Well-formedness: Unbalanced tags, partial DOMs, etc. This concern basically sets up the scope of tree-builder fixup. (<domparse> tag, for ex.)
      • Content-model constraints: Even if well-formed, you cannot transclude A-links inside another A-link. Basically, the overriding concern is: what is required to simply "drop-in" the DOM output of a transclusion into a DOM-tree? One way is to enforce constraints on what a template can produce in all possible expansions for all possible inputs ("static typing" and automatic type coercion). (See Parsoid/DOM_notes)
    • LintTrap/WikiLint: GsoC project
  • Support for authorship maps
    • Requires stable element ids
  • Editing support:
    • Support for switching between HTML/Wikitext in the editor. Naive thing is not too difficult to support, but will not be very performant likely. To be investigated.
    • Support for HTML editing of transclusion parameters (in progress).
    • Possibly support content widgets for common tasks (for which a combination of tpls are currently used; infoboxes, football tables, discographies, etc.)
  • Support for any common but unsupported extensions including porting
    • Native gallery port
    • LST
    • Other extensions in non-wikipedia projects (wikisource, etc.)
  • Support for HTML wikis
    • HTML-based templating
    • Content widgets
    • HTML diffs
    • Abuse Filter

Testing[edit]

  • Parser tests
    • Selser-testing is still pretty painful. As selser is getting more refined, and as our accuracy in general improves, it is getting harder and harder to trust both "green"/"red" results from parser test runs. We may need to consider more controlled edit generation where we can construct an oracle to give us authoratitive edited wikitext to compare selser against.
    • Porting PHP preprocessor and eliminating our native full expansion pipeline.
    • Enable test mode with HTML tidy enabled
  • RT-testing
    • Fix our mysql-based rt testing or move over to cassandra.
    • Upgrades to selser testing. (in progress)
    • Automated diffs against PHP rendering to detect problems with rendering (for HTML page views).

Performance (ongoing)[edit]

  • More efficient re-rendering of pages after edits / template changes.
  • Ongoing identification of bottlenecks.

Maintenance (ongoing)[edit]

  • Regular deployments and monitoring.
  • Hooking up our logging infrastructure with logstack/bunyan.
  • Ongoing bug fixes.
  • node upgrades (from 0.8 to 0.10 and onwards).
  • code cleanup and rewrite as we upgrade node versions.

Mentoring / documentation / talks (ongoing)[edit]

  • GSoC, OPW, others
  • Maintaining our documentation
    • We should probably maintain a docs/ repo that outlines strategies or preferably, maybe add broad outlines of algorithms at the top of files.
  • Writing blog posts
  • Presentations (tech talks, elsewhere)