Wikimedia Technical Conference/2018/Session notes/Identifying the requirements and goals for the parser

From mediawiki.org

Questions to answer during this session[edit]

Question Significance:

Why is this question important? What is blocked by it remaining unanswered?

What is the product vision for visual editing and editing on mobile? If edits become majority visual vs source, how does this impact parser design? What other product goals are likely to impact the design of the parser and how will they do that?   If products are heading towards a WYSIWYG or a micro-edit experience for the majority of users it makes sense to evaluate the needs of the parser in that light. Answers to this question could guide how wikitext might evolve, in what ways, and what kind of tools the parser might need to support.
What are the impacts of parser speed on our technical infrastructure (specifically regarding storage)? What is a good goal for speed of the parser? What does it mean to be fast (returning HTML from storage is fast, but does it need to be fast when generating the HTML?)? Should we only be concerned with balanced templates so that we do not have to regenerate a whole page when content changes? Speed of the parser has been mentioned in several contexts. It isn’t clear what is meant by this. Are engineers concerned with processor load when regenerating pages or are client engineers and PMs concerned with response time? Are we concerned with the worst case or median times? Is this a user concern or an infrastructure concern? Is the parser already fast enough? Unbalanced templates are known to be an issue here as well since they can modify the rest of the page.
Should wikitext be the canonical storage for content in MediaWiki? What are the trade-offs between storing HTML vs Wikitext? Does it make sense to store content as Wikitext if we are returning HTML to clients 99% of the time? Storing HTML seems to remove some of the burden off of the parser since we would only need to support converting to Wikitext when a user want to edit in WIkitext.
What are the trade offs between unifying the parsers to a Node.js implementation vs a PHP implementation? Prior to unifying the parser into PHP, we should/ensure there are no use cases or reasons to keep the parser in JS like clients parsing in the browser or in apps. Additionally we should make sure any future needs for VE are accounted for before making this move.
Should having a deterministic/repeatable parser be a goal? Is it useful to have a concept of static vs dynamic templates? What are the advantages to doing this? What are the roadblocks to this? (Specifically discuss Wikitext, Templates, Lua modules) Not having a deterministic parser has been identified as one of the major reasons to store edits for VE on the server. Is being able to guarantee most of the page stays the same actually get us any benefits? We know that dynamic content is possible in templates, but if we close them and contain that logic does it provide benefits?
Do we want to evolve wikitext? If so, what aspects / shortcomings do we want to target? What are possible solutions for addressing them? What are the considerations we should factor into any such evolution path? A number of challenges we now face in the parser and in our products are an outgrowth of wikitext and how it is processed. Certain editing, technology, and usability goals might be advanced / enabled by suitably updating wikitext. But, since this directly impacts editor workflows, this needs to be addressed carefully.

Attendees list[edit]

  • Subbu, Michael, Nick, Kate, Tim, Mark, Alexia, Josh, Santhosh, Timo, Adam, Corey, Leszek

Structured notes[edit]

There are five sections to the notes:

  1. Questions and answers: Answers the questions of the session
  2. Features and goals: What we should do based on the answers to the questions of this session
  3. Important decisions to make: Decisions which block progress in this area
  4. Action items: Next actions to take from this session
  5. New questions: New questions revealed during this session

Session Introduction slides[edit]

https://commons.wikimedia.org/wiki/File:Techconf2018.parsing.session.intro.pdf

Questions and answers[edit]

Please write in your original questions. If you came up with additional important questions that you answered, please also write them in. (Do not include “new” questions that you did not answer, instead add them to the new questions section)

Q: What is the product vision for visual editing and editing on mobile? If edits become majority visual vs source, how does this impact parser design? What other product goals are likely to impact the design of the parser and how will they do that?
Editing product considerations
  • Wikitext editing is not going away
  • Sub-page editing functionality being considered
    • Useful to know granularities that parsoid can support (ex: lists or list items? tables or table rows or table cells, etc.)
  • Mobile editing, wikitext or VE-esque.
  • Fully offline editing is not on the horizon
Q: What are the impacts of parser speed on our technical infrastructure (specifically regarding storage)? What is a good goal for speed of the parser? What does it mean to be fast (returning HTML from storage is fast, but does it need to be fast when generating the HTML?)? Should we only be concerned with balanced templates so that we do not have to regenerate a whole page when content changes?
We had outlines of how to answer this question. Requires follow up.

In the abstract, parsing speed impacts user features. But, in the concrete, we are concerned with how a unified parser is going to behave performance wise. In the session, we worked with the assumption that Parsoid-PHP might be 2x slower than Parsoid-js and 3-5x slower that the legacy PHP parser. Given this, for Parsoid-PHP to be used for all use cases, a performance plan is needed that determines how we will tackle slowdowns relative to the PHP parser. This is a blocker for the unified parser to be used for everything.

Relatedly, we need to determine what kind of experiences require a low-latency parse and what experiences can tolerate higher latencies? Are there latency trade-offs here with respect to page size? Are there paths like incremental parsing and other strategies that could / should be explored actively?

Q: Should wikitext be the canonical storage for content in MediaWiki? What are the trade-offs between storing HTML vs Wikitext?
Partially answered. We didn’t get around to discussing this beyond updating a poster. This needs follow up.

Broadly, there are a bunch of benefits to Parsoid HTML being an archival format and being used as canonical storage for content. But, it is unclear if Parsoid HTML is actually an archival format at this point. It is also 7x the size of wikitext and what does that mean for storage. Needs further input from SRE if this is actually an option.

See detailed notes for pros / cons.

Q: What are the trade offs between unifying the parsers to a Node.js implementation vs a PHP implementation?
This question had a strong gravitational pull among participants and was reasonably addressed.

Broadly, while there were some concerns, the current path towards porting into PHP didn’t have any objections. Performance was flagged as a big concern. Using Parsoid-PHP for all uses requires a plan to address slowdowns relative to the current parser. See first qn. and detailed notes. One of the flagged cons of the PHP path was the absence of concurrency primitives at this point in time. But, some participants had ideas for how to handle that (to be followed up when the time is right).

See detailed notes for pros / cons.

Q: Should having a deterministic/repeatable parser be a goal? Is it useful to have a concept of static vs dynamic templates? What are the advantages to doing this? What are the roadblocks to this? (Specifically discuss Wikitext, Templates, Lua modules)
Not actively discussed given the time constraint. See post-it note comments in the detailed notes section.
Q: Do we want to evolve wikitext? If so, what aspects / shortcomings do we want to target? What are possible solutions for addressing them? What are the considerations we should factor into any such evolution path?
Partially answered and shelved for later follow up.

Wikitext evolution was broadly considered a good thing. But, what aspects of wikitext and what should drive those changes need to be determined. There are some changes like moving to balanced template semantics that would benefit parsing technology wrt performance (among others). But, there might be other changes like tweaks to wikitext syntax that might be driven by human editor considerations. All such changes should be tied to user stories. Some of these changes might only happen in a Parsoid-only future (either to avoid duplicate work or because it is not possible to do in the legacy PHP parser).

Important decisions to make[edit]

What are the most important decisions that need to be made regarding this topic?
1.
Why is this important? What is it blocking? Who is responsible?
2.
Why is this important? What is it blocking? Who is responsible?

Action items[edit]

What action items should be taken next for this topic? For any unanswered questions, be sure to include an action item to move the process forward.
1. We need a performance plan for addressing Parsoid-PHP performance post-port.
Why is this important?

We don’t want a performance regression and user-visible impacts when we switch parsers.

What is it blocking?

Blocks using Parsoid for everything

Who is responsible?

Parsing team (with help from Platform team maybe?)

New Questions[edit]

What new questions did you uncover while discussing this topic?
Do we want to pursue Parsoid HTML as an archival format / canonical storage for wiki pages? If yes, what needs to happen to Parsoid output wrt archivability, versioning, size?
Why is this important?

RESTBase currently stores Parsoid HTML for latest revisions which helps with read latencies. If we could store Parsoid HTML for all revisions, it can support future wikitext evolution.

What is it blocking?

Nothing immediate.

Who is responsible?

Parsing team

What kind of wikitext evolution directions do we want to pursue? What timelines?
Why is this important?

This has impact on user features, parsing performance.

What is it blocking? Who is responsible?

Parsing team w/ product

Detailed notes[edit]

Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.

Slides:

  • Discussion, what to do about 2 parsers. How to go about unifying them?
  • PHP parser generates HTML. Since 2012, we have parsoid which is bidirectional (wt <-> html)
  • Mostly compatible output, but some differences. Don't want two forever.
  • When we started, wasn’t clear we could go both ways.
  • Parsoid does a lot more work to support this functionality
  • Josh: Clarify "dirty diffs"?  -- [...]
  • Corey: not deterministic, if you do the same 10 mins later they won't be the same,
  • Not just show, they shouldn’t make the change at all.
  • Timo: Whitespace is actually preserved by default, and note that it’s also expressible in HTML
  • Right now it keeps track of how we got there, to enable us to go back.
  • Implication basically is that Parsoid does a lot more work to support the same features, so parsing is slower.
  • 4 pieces in parsoid, tokenizer, template extensions etc, build DOM, transform DOM
  • 3 steps to parser unification (refer slides)
  • [slide of photo of drawing diagramming the Parsoid/Restbase/Cassandra stack]
  • Do we want to store HTML forever, is it just caching, implications of storing HTML vs wikitext.
  • 2 parsers are chatting a lot. Not a clean arch. Also at this time trying to bridge that interface, move everything inside php, so that everything that’s now an API call is moved to the Parsoid internals
  • Broadly multistep process to unify: After step 1, still 2 parsers in core. Clients hit restbase, restbase hits parsoid-php.
  • Key difference with PHP parser: it exposes parser internals in its API, Parsoid won’t (by design)
  • All clients his parsoid-php
  • Final step: Remove legacy php parser from MW
  • Q (Adam): Does the “fix feature + output diffs” item include replicating the PHP parser behavior around user login state?
    • To reduce the work needed to migrate, to preserve functionality as required, and then cleanup. We want to narrowly scope this, to avoid extraneous work.
      • So user-dependent stuff will remain, with [annotations?] piled on top.
  • Q (Corey): We want to unify parsers because it’s too chatty, but is that a feature or a bug?
    • Subbu: That's part of the PHP/Node.js tradeoffs question. :-)
    • Josh: Chatty between 2 libraries instead of 2 systems.
    • Timo: Cost of chattiness is a lot lower within the same instance.  Note that PHP initialization cost (per request) is fairly high.
  • In terms of implementation, performance is a big question.
  • [slide of rough performance profiles]
  • One of the questions is: What do we mean by "having a fast parser"
  • Timo: In Parsoid PHP land, can parsing functionality be turned off that is only used for editing that would speed up parsing? Subbu: not sure, to be investigated.
  • Tim (Q): Is it possible to have some level of media query batching in the new PHP Parsoid?
    • Subbu: Yes, will be, Arlo is almost done with it in Parsoid and will be ported.
  • What is, from product POV, how do you see the API evolving?
  • Josh: editability at smaller increments than fullpage. E.g. editing infoboxes, sections.
  • Josh (cont.): There’s no product goal to remove or replace wikitext editing, we'll continue to have VE and Wikitext to consider. But we might have VE-esque experiences where we're just editing a single template.
  • Timo: This isn’t really a blocker, but there’s various feature ideas that could be done in the current implementation, but I wonder if PHP or JS question plays into it.
    • E.g., section editing. Would it be viable to apply this mechanism to something like individual nodes of the DOM tree?
  • Subbu: questions Is the granularity of editing. Product question?
  • Josh: If we know that certain DOM node types are editable, does that help us do VE locally, or something like “VE-lite,” e.g., for offline? If we could get to the point where specific things are editable at granular level, - e.g. lists yes, but list-items no - that'd be helpful.
  • Timo: IIUC, all of these ideas depend on things that Parsoid is able to offer, but the PHP parser isn’t; that would get easier
  • Corey: if you have a page, represents[???] Could send just that section, that node, back to the parser, which would [?] - -- Subbu: yes, with caveats.
  • Josh: [repeating point about offline use case] -- there’s no plan to be able to fully edit when offline.
  • Corey: Question:  is there an implication for you, on intermittent vs offline ? - at what point does that become the same thing - is 5 mins and 10 seconds the same thing?
  • [group broke apart into looking posters + adding sticky notes]
  • Tim: Question: Storage format, whether parsoid html is an archival format? To be retained indefinitely?.
  • Subbu: Is anyone concerned about the port to PHP?
  • AdamB: I’m concerned about cost, but in terms of person-coding hours, it might a wash. If we keep it in node, a lot of extra things we'd have to write to make it work. But similar in php.
  • Subbu: already on the path to take it to php. Are there any other questions to answer before we go all the way?
  • Tim: I don’t think a 2x [or even 1.5x] speed reduction [as calculated in slides] is OK.  Need something closer to parity with the current parser. Unless we’re a lot smarter.
  • Subbu: Parsoid-PHP 2x is relative to Parsoid, so could even be 3-4x slower than php parser.
  • Timo: so conversion from wikitext to html is done ahead of time and cached. That could get slower, but its behind restbase and done on demand.
  • Subbu: We're not getting rid of php parser immediately. Once we port, it’s the speed it is, then we deal with performance.
  • Tim; 3.5x slower is a no-go from me, unless we have plan to get around it.
  • Subbu, so you mean we cannot  get rid of the legacy php parser, but we can still port parsoid to php?
  • Tim: perhaps, yes.
  • Question / Decision: Need a plan with go/no-go point for switching over completely to parsoid-php.
  • Timo: I think part of the problem space is that Parsoid is doing more work for the same use case than the legacy parser.  But perhaps some of those [which?] will fall away once we’re ported.
  • Question: What is the perf plan for finishing the parser unification?
  • Corey: why exactly is 3x slower unacceptable? Can we specify the problems that would entail?
  • Josh: If some things get better and some things get worse, acceptability might depend on what those things are.  E.g., switching between VE and wikitext editing is a slow but rare operation compared to something like reading and hitting edit.  
  • Tim: even Visualedits have to go through parsoid.
  • Timo: Ah, right, Has to be parsed twice (?)
  • Subbu: so, fair to say we need more discussion and planning about when we can do steps 2 and 3. But, is this a blocker for step 1?
  • Corey: No, we just need to know what the perf implications are, As it might not actually matter in all cases.
  • Adam: Toby talks about the 100ms parser, that’s probably not realistic, but performance really shouldn’t get worse.
  • Josh: use-case, what causes the scaling to be slow. Eg pages like thousand line tables, if they're slow, that's a potential tradeoff. What’s the scaling? eg. number of nodes, or etc.
  • Corey: There’s a mitigating factor, actually a possible game changer, that we know full pages parses will take longer, but maybe we’ll no longer need to parse the whole page but only parts?
  • Subbu: Incremental parsing, yes.
  • Suggestion: pursue [parallel? incremental?] parsing
  • New Q (Subbu): Are you happy with how wikitext is today?
  • Alexia: Special symbols used by wikitext today
  • Tim: There are some annoying details we can fix and I know C. Scott has some ideas
  • C. Scott's ideas about converting wt1 to wt 2 on some flag day]. Or if we pursue the idea of storing parsoid html, the details of wikitext become almost irrelevant, if we don't mind changing the users intention a little bit.
  • Q. What aspects of wikitext do you want to change. Syntax, templates, semantics, .. ?
  • Josh: There's no impetus to evolve wikitext just to make it more clean. That'd be counterproductive at this point (after ~15 years). But there are known problems downstream, like html insertion, unbalanced templates. Could invest in changing them only if there are good specific outcomes.
  • Timo: There may need to be changes to Wikitext, but they should be connected to user stories.  Right now there can be markers that have no visible impact on output, but affect stuff like metadata.  We can use APIs for the non-user-visible stuff.
  • Corey: does that blend over with Structured Data stuff. If we pull out the categories etc…
  • Subbu: [?]
  • Tim: We could improve wikitext on its own design terms (rather than, say, in reference to visual editing), the point of it is to be easy for humans (not computers) to understand.  We could improve it, e.g., by allowing vertical whitespace between list items, or line breaks. Only 2 choices are parsoid html, or wikitext.
  • Josh: we agree that source editing is always going to be a thing, and is a good thing. But could it become easier for editors. There are things that have come along since which are easier, like markdown, and perhaps we should consider that, but there hasn't been a groundswell of editors/board/etc requesting it so far. The most egregious template examples would be a huge amount of effort.
  • Subbu: Let’s reframe this to the questions that need answering. What are the considerations to factor in here?
  • Josh: need to articulate spotwise, e.g. "deterministic parsing is hard because of this part of wikitext"
  • Tim: I semi-agree that it's not a priority, but once we're in a parsoid-only parser future, we can more easily potentially change wikitext. We can't really do it right now.
  • (Things we might want: e.g semantics of balanced template, multiline list items., nondeterministic lua modules, enabling [?] caching, incremental parsing.)
  • Timo: We could reduce the overhead of working with Wikitext by things like removing pipe tricks, … , to reduce ambiguity and number of ways things can be written.
  • Josh: any of these things, should be asking PMs and Community, what communication pipelines. So that technical teams aren’t doing it by themselves.
  • Corey: Talk to Quim about more CRS support.
  • Subbu: [soliciting ideas from Alexia as a third-party user]
  • Alexia: I love the idea of putting everything back into one tech stack rather than having separate services for it
  • In steps 2 and 3 [?]
  • AdamB: be clear we're going to cut support for the thing. - Standard deprecation practices.
  • Josh: [?]
  • Timo: Parser extensions, new functions. Session coming up about that.
  • Timo: Q: please describe why do we need to change the hooks?
  • Subbu: Because the PHP parser hooks depend on parser internals (hooks into specific parts of the parsing pipeline).
  • Timo: is it just about different way of doing it, or different requirements. Still have to provide info about how to get back to wikitext.
  • Su: We want to make it more semantic and less about how it's parsed. To/From DOM. Or for subpage editing. [?]
  • Timo: Keep in mind that this is a new requirement.
  • Tim: Parsoid is just bundling wikitext in data-mw attributes for [...]
  • Subbu: there's a spectrum there
  • Timo: requires plugins to parser service.
  • Subbu: Q. what happens to php extensions.

Photos[edit]

Post-it notes[edit]

Parser performance question[edit]

  • Parity with existing parser
  • (Page?) save time
  • Save times and published round trip times perceived to be good on median page
  • Performance scales badly only with perverse case
  • Reduce intermediary caching needs
  • “Fast” = Fast enough to support live traffic
  • Balanced templates: “Please”!

Parser in Node.js vs PHP pros/cons[edit]

Node.js

  • Performance
  • “Modern” language
  • Diverse job applicants
  • Reimpl / port all parser extensions
  • What happens to 3rd party wikis & their PHP extensions?
  • Unclear scope: what else will need to get ported from Core?
  • Figure out site configuration (??)

PHP

  • Consolidation in a single language / core
  • Code simplifications without async paths in codebase
  • Limited scope: Port all of Parsoid to PHP
  • Hosting simplification
  • Performance?
  • No async / concurrency (responses: could be solved in a number of ways)

Should wikitext or (Parsoid) HTML be canonical storage?[edit]

Wikitext

  • Efficient storage
  • Wikitext diffs
  • Not a standard / No spec
  • Cannot render old revisions faithfully

Parsoid HTML

  • Quick retrieval (for VE, etc)
  • Render old revisions “faithfully”
  • HTML dumps
  • Analyze content without parsing wikitext
  • Standardized
  • Evolve wikitext without compat concerns / Format changes without b/c
  • Slow retrieval for source edits
  • Storage space

Qn: Is Parsoid HTML an archival format?

Deterministic parsing[edit]

  • Yes, so we don’t have to parse the whole page
  • This seems to depend more on the inputs - should this impact the design of the parser itself?
  • If it helps with blame or accurate page history