Wikimedia Technical Conference/2018/Session notes/Identifying the requirements and goals for the parser

= Questions to answer during this session =

= Attendees list =


 * Subbu, Michael, Nick, Kate, Tim, Mark, Alexia, Josh, Santhosh, Timo, Adam, Corey, Leszek

= Structured notes = There are five sections to the notes:


 * 1) Questions and answers: Answers the questions of the session
 * 2) Features and goals: What we should do based on the answers to the questions of this session
 * 3) Important decisions to make: Decisions which block progress in this area
 * 4) Action items: Next actions to take from this session
 * 5) New questions: New questions revealed during this session

Session Introduction slides
https://commons.wikimedia.org/wiki/File:Techconf2018.parsing.session.intro.pdf

= Questions and answers = Please write in your original questions. If you came up with additional important questions that you answered, please also write them in. (Do not include “new” questions that you did not answer, instead add them to the new questions section)

= Important decisions to make =

= Action items =

= New Questions =

= Detailed notes = Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.

Slides:


 * Discussion, what to do about 2 parsers. How to go about unifying them?
 * PHP parser generates HTML. Since 2012, we have parsoid which is bidirectional (wt <-> html)
 * Mostly compatible output, but some differences. Don't want two forever.
 * When we started, wasn’t clear we could go both ways.
 * Parsoid does a lot more work to support this functionality
 * Josh: Clarify "dirty diffs"?  -- [...]
 * Corey: not deterministic, if you do the same 10 mins later they won't be the same,
 * Not just show, they shouldn’t make the change at all.
 * Timo: Whitespace is actually preserved by default, and note that it’s also expressible in HTML
 * Right now it keeps track of how we got there, to enable us to go back.
 * Implication basically is that Parsoid does a lot more work to support the same features, so parsing is slower.
 * 4 pieces in parsoid, tokenizer, template extensions etc, build DOM, transform DOM
 * 3 steps to parser unification (refer slides)
 * [slide of photo of drawing diagramming the Parsoid/Restbase/Cassandra stack]
 * Do we want to store HTML forever, is it just caching, implications of storing HTML vs wikitext.
 * 2 parsers are chatting a lot. Not a clean arch. Also at this time trying to bridge that interface, move everything inside php, so that everything that’s now an API call is moved to the Parsoid internals
 * Broadly multistep process to unify: After step 1, still 2 parsers in core. Clients hit restbase, restbase hits parsoid-php.
 * Key difference with PHP parser: it exposes parser internals in its API, Parsoid won’t (by design)
 * All clients his parsoid-php
 * Final step: Remove legacy php parser from MW
 * Q (Adam): Does the “fix feature + output diffs” item include replicating the PHP parser behavior around user login state?
 * To reduce the work needed to migrate, to preserve functionality as required, and then cleanup. We want to narrowly scope this, to avoid extraneous work.
 * So user-dependent stuff will remain, with [annotations?] piled on top.
 * Q (Corey): We want to unify parsers because it’s too chatty, but is that a feature or a bug?
 * Subbu: That's part of the PHP/Node.js tradeoffs question. :-)
 * Josh: Chatty between 2 libraries instead of 2 systems.
 * Timo: Cost of chattiness is a lot lower within the same instance.  Note that PHP initialization cost (per request) is fairly high.
 * In terms of implementation, performance is a big question.
 * [slide of rough performance profiles]
 * One of the questions is: What do we mean by "having a fast parser"
 * Timo: In Parsoid PHP land, can parsing functionality be turned off that is only used for editing that would speed up parsing? Subbu: not sure, to be investigated.
 * Tim (Q): Is it possible to have some level of media query batching in the new PHP Parsoid?
 * Subbu: Yes, will be, Arlo is almost done with it in Parsoid and will be ported.
 * What is, from product POV, how do you see the API evolving?
 * Josh: editability at smaller increments than fullpage. E.g. editing infoboxes, sections.
 * Josh (cont.): There’s no product goal to remove or replace wikitext editing, we'll continue to have VE and Wikitext to consider. But we might have VE-esque experiences where we're just editing a single template.
 * Timo: This isn’t really a blocker, but there’s various feature ideas that could be done in the current implementation, but I wonder if PHP or JS question plays into it.
 * E.g., section editing. Would it be viable to apply this mechanism to something like individual nodes of the DOM tree?
 * Subbu: questions Is the granularity of editing. Product question?
 * Josh: If we know that certain DOM node types are editable, does that help us do VE locally, or something like “VE-lite,” e.g., for offline? If we could get to the point where specific things are editable at granular level, - e.g. lists yes, but list-items no - that'd be helpful.
 * Timo: IIUC, all of these ideas depend on things that Parsoid is able to offer, but the PHP parser isn’t; that would get easier
 * Corey: if you have a page, represents[???] Could send just that section, that node, back to the parser, which would [?] - -- Subbu: yes, with caveats.
 * Josh: [repeating point about offline use case] -- there’s no plan to be able to fully edit when offline.
 * Corey: Question:  is there an implication for you, on intermittent vs offline ? - at what point does that become the same thing - is 5 mins and 10 seconds the same thing?
 * [group broke apart into looking posters + adding sticky notes]
 * Tim: Question: Storage format, whether parsoid html is an archival format? To be retained indefinitely?.
 * Subbu: Is anyone concerned about the port to PHP?
 * AdamB: I’m concerned about cost, but in terms of person-coding hours, it might a wash. If we keep it in node, a lot of extra things we'd have to write to make it work. But similar in php.
 * Subbu: already on the path to take it to php. Are there any other questions to answer before we go all the way?
 * Tim: I don’t think a 2x [or even 1.5x] speed reduction [as calculated in slides] is OK.  Need something closer to parity with the current parser. Unless we’re a lot smarter.
 * Subbu: Parsoid-PHP 2x is relative to Parsoid, so could even be 3-4x slower than php parser.
 * Timo: so conversion from wikitext to html is done ahead of time and cached. That could get slower, but its behind restbase and done on demand.
 * Subbu: We're not getting rid of php parser immediately. Once we port, it’s the speed it is, then we deal with performance.
 * Tim; 3.5x slower is a no-go from me, unless we have plan to get around it.
 * Subbu, so you mean we cannot  get rid of the legacy php parser, but we can still port parsoid to php?
 * Tim: perhaps, yes.
 * Question / Decision: Need a plan with go/no-go point for switching over completely to parsoid-php.
 * Timo: I think part of the problem space is that Parsoid is doing more work for the same use case than the legacy parser.  But perhaps some of those [which?] will fall away once we’re ported.
 * Question: What is the perf plan for finishing the parser unification?
 * Corey: why exactly is 3x slower unacceptable? Can we specify the problems that would entail?
 * Josh: If some things get better and some things get worse, acceptability might depend on what those things are.  E.g., switching between VE and wikitext editing is a slow but rare operation compared to something like reading and hitting edit.
 * Tim: even Visualedits have to go through parsoid.
 * Timo: Ah, right, Has to be parsed twice (?)
 * Subbu: so, fair to say we need more discussion and planning about when we can do steps 2 and 3. But, is this a blocker for step 1?
 * Corey: No, we just need to know what the perf implications are, As it might not actually matter in all cases.
 * Adam: Toby talks about the 100ms parser, that’s probably not realistic, but performance really shouldn’t get worse.
 * Josh: use-case, what causes the scaling to be slow. Eg pages like thousand line tables, if they're slow, that's a potential tradeoff. What’s the scaling? eg. number of nodes, or etc.
 * Corey: There’s a mitigating factor, actually a possible game changer, that we know full pages parses will take longer, but maybe we’ll no longer need to parse the whole page but only parts?
 * Subbu: Incremental parsing, yes.
 * Suggestion: pursue [parallel? incremental?] parsing
 * New Q (Subbu): Are you happy with how wikitext is today?
 * Alexia: Special symbols used by wikitext today
 * Tim: There are some annoying details we can fix and I know C. Scott has some ideas
 * C. Scott's ideas about converting wt1 to wt 2 on some flag day]. Or if we pursue the idea of storing parsoid html, the details of wikitext become almost irrelevant, if we don't mind changing the users intention a little bit.
 * Q. What aspects of wikitext do you want to change. Syntax, templates, semantics, .. ?
 * Josh: There's no impetus to evolve wikitext just to make it more clean. That'd be counterproductive at this point (after ~15 years). But there are known problems downstream, like html insertion, unbalanced templates. Could invest in changing them only if there are good specific outcomes.
 * Timo: There may need to be changes to Wikitext, but they should be connected to user stories.  Right now there can be markers that have no visible impact on output, but affect stuff like metadata.  We can use APIs for the non-user-visible stuff.
 * Corey: does that blend over with Structured Data stuff. If we pull out the categories etc…
 * Subbu: [?]
 * Tim: We could improve wikitext on its own design terms (rather than, say, in reference to visual editing), the point of it is to be easy for humans (not computers) to understand.  We could improve it, e.g., by allowing vertical whitespace between list items, or line breaks. Only 2 choices are parsoid html, or wikitext.
 * Josh: we agree that source editing is always going to be a thing, and is a good thing. But could it become easier for editors. There are things that have come along since which are easier, like markdown, and perhaps we should consider that, but there hasn't been a groundswell of editors/board/etc requesting it so far. The most egregious template examples would be a huge amount of effort.
 * Subbu: Let’s reframe this to the questions that need answering. What are the considerations to factor in here?
 * Josh: need to articulate spotwise, e.g. "deterministic parsing is hard because of this part of wikitext"
 * Tim: I semi-agree that it's not a priority, but once we're in a parsoid-only parser future, we can more easily potentially change wikitext. We can't really do it right now.
 * (Things we might want: e.g semantics of balanced template, multiline list items., nondeterministic lua modules, enabling [?] caching, incremental parsing.)
 * Timo: We could reduce the overhead of working with Wikitext by things like removing pipe tricks, …, to reduce ambiguity and number of ways things can be written.
 * Josh: any of these things, should be asking PMs and Community, what communication pipelines. So that technical teams aren’t doing it by themselves.
 * Corey: Talk to Quim about more CRS support.
 * Subbu: [soliciting ideas from Alexia as a third-party user]
 * Alexia: I love the idea of putting everything back into one tech stack rather than having separate services for it
 * In steps 2 and 3 [?]
 * AdamB: be clear we're going to cut support for the thing. - Standard deprecation practices.
 * Josh: [?]
 * Timo: Parser extensions, new functions. Session coming up about that.
 * Timo: Q: please describe why do we need to change the hooks?
 * Subbu: Because the PHP parser hooks depend on parser internals (hooks into specific parts of the parsing pipeline).
 * Timo: is it just about different way of doing it, or different requirements. Still have to provide info about how to get back to wikitext.
 * Su: We want to make it more semantic and less about how it's parsed. To/From DOM. Or for subpage editing. [?]
 * Timo: Keep in mind that this is a new requirement.
 * Tim: Parsoid is just bundling wikitext in data-mw attributes for [...]
 * Subbu: there's a spectrum there
 * Timo: requires plugins to parser service.
 * Subbu: Q. what happens to php extensions.

Photos
= Post-it notes =

Parser performance question

 * Parity with existing parser
 * (Page?) save time
 * Save times and published round trip times perceived to be good on median page
 * Performance scales badly only with perverse case
 * Reduce intermediary caching needs
 * “Fast” = Fast enough to support live traffic
 * Balanced templates: “Please”!

Parser in Node.js vs PHP pros/cons
Node.js


 * Performance
 * “Modern” language
 * Diverse job applicants


 * Reimpl / port all parser extensions
 * What happens to 3rd party wikis & their PHP extensions?
 * Unclear scope: what else will need to get ported from Core?
 * Figure out site configuration (??)

PHP


 * Consolidation in a single language / core
 * Code simplifications without async paths in codebase
 * Limited scope: Port all of Parsoid to PHP
 * Hosting simplification


 * Performance?
 * No async / concurrency (responses: could be solved in a number of ways)

Should wikitext or (Parsoid) HTML be canonical storage?
Wikitext


 * Efficient storage
 * Wikitext diffs


 * Not a standard / No spec
 * Cannot render old revisions faithfully

Parsoid HTML


 * Quick retrieval (for VE, etc)
 * Render old revisions “faithfully”
 * HTML dumps
 * Analyze content without parsing wikitext
 * Standardized
 * Evolve wikitext without compat concerns / Format changes without b/c


 * Slow retrieval for source edits
 * Storage space

Qn: Is Parsoid HTML an archival format?

Deterministic parsing

 * Yes, so we don’t have to parse the whole page
 * This seems to depend more on the inputs - should this impact the design of the parser itself?
 * If it helps with blame or accurate page history