Requests for comment/A Spec For Wikitext

Maybe this should have been in my user space since this is somewhat personal opinion, not a parsing team collective opinion, but I am leaving this in the Parsing Team Notes space since I think it fits here better and also because I think there is some degree of agreement on some aspects of this (but definitely not all of it). --SSastry (WMF) (talk) 05:45, 6 August 2016 (UTC)

In this wikitech-l thread from August 2, 2016, I (Subbu Sastry) outlined how one might get to a formal spec for wikitext, but whether we want a spec for wikitext and whether we ought to work towards that is different question. Rob Lanphier, in a followup, proposed that ArchCom discussion on this question, i.e. whether there ought to be a formal spec for wikitext. A spec for wikitext is also one of the proposals for how to deal with the 2-systems problem we have with wikitext parsing.

With the goal of guiding that ArchCom discussion, in this note, I want to address this question a bit, i.e. whether there ought to be a format spec for wikitext, and whether it is a worthwhile goal to work towards. The most important argument I want to make here and that I want you, the reader, to take away from here is that this question of whether we want a spec cannot be separated from the question of what the goals are of wanting a spec are (or put another way, what might a formal spec give us?). While that is probably a somewhat obvious observation for most of you, I want to make that explicit so we don't lose track of it in our discussions.

Complexities of "parsing" wikitext: The Parsoid Experience
As I noted in passing in the preamble to that wikitech-l thread, when talking about wikitext and the difficulty of parsing wikitext, there is often a focus on how there doesn't exist a BNF grammar for wikitext and how it is context-sensitive, etc. However, I think that is not such as relevant compared to what I think are other sources of complexity in wikitext parsing in Mediawiki. I alluded to them in that previously referenced email, but I am organizing them here a bit more coherently than in the email. These come from our experience developing (which is ongoing) Parsoid while trying to achieve parity in the output of Parsoid and the core parser as well as support its editing clients. You could replace wikitext with markdown and none of the problems outlined here would be solved.
 * String-based model of processing: Wikitext parsing semantics are primarily string-based. It is string-based in terms of how input is processed (macro-expansion model), and what kind of output is produced (HTML strings, not DOM structures). Here are some more specific ramifications of this processing model.
 * Macro-expansion model of templating: Parsing a page requires all transclusions (and extension tags) to be fully expanded before the page can be parsed. Wikitext like   foo bar   cannot be parsed to a bold string without knowing what transclusion returns, even though it looks like the entire string could be wrapped in a bold tag. But yes, this is not a problem for a spec per se, since a wikitext spec could formalize the notion of a preprocessing phase. We'll get to this later in this document.
 * Inability to bound effects of markup errors: Unbalanced tags and other errors in markup can cause "unpredictable" rendering effects on the page. But, more importantly, the rendering is dependent on how the errors are fixed up. For example, so far, Mediawiki has primarily used Tidy as the error fixup tool. But, the current Tidy bundle is based on HTML4 semantics and also introduces its own unnecessary "cleanup" of markup. However, the rendering changes if we used a different cleanup tool. So, a spec will have to deal with this issue.


 * Entanglement with parser internals: Mediawiki makes available hooks into the parser where extensions can hook into the parsing pipeline (ex: before/after some thing happens). This is a problem for a new parser implementation since those hooks may not exist in an alternative implementation, and when they do, the event ordering may not be identical.
 * Dependence on user state and database state: Mediawiki output can be changed by user preferences as well as state of the database (red links, bad images, for example). So, mediawiki output cannot be generated without querying the database. An alternative way of modelling this is to treat these as post-processing transformations of the parser's output. A spec would have to deal with this issue. If I were to push this line of reasoning further, a related question to ask is whether the parser needs to know about media resources beyond knowing the type of resources they are (image, audio, video). As it exists, the output requires knowing a lot more about media resources than is available from the wikitext markup. An alternative way of handling this might be to use a post-processing transformation to further tweak the parser's output.
 * Lack of awareness of the HTML -> wikitext conversion pathway: The current parsing model is unaware of the need to convert the parsed HTML back to source wikitext (and with very strict requirements on the form of this HTML -> wikitext conversion). How might a wikitext spec look if it factored in these requirements?

Is it a parser? Is it a runtime? Is it a transformer?
Some of the confusion and imprecise discussion around a spec is also a problem of nomenclature. It is easier to see this by looking a traditional language compiler. A traditional language compiler has many very distinct architectural phases. There is the parsing phase (traditionally comprising of a lexer and a parser), there is a semantic analysis phase (type checking, etc.), there is a code optimization phase, and there is a code generation phase. The parser is just one part of the pipeline and later passes build on its output. Parsing is a very limited part of the pipeline that takes a source level program to executable code.

Based on the previous section it should be fairly clear that a well-specific BNF grammar (for wikitext) or using Markdown with a clean grammar does not really help building an alternative wikitext "runtime" (as the blurb on the Parsoid page says).

In addition, in today's Mediawiki incarnation as it is used in the Wikimedia universe, we need to be able to transform wikitext to HTML and transform HTML to wikitext (sometimes with additional constraints when the HTML is derived from an edit).

Given both of that, "parser" is a fairly loose and inaccurate term for what we might try to develop a spec for. Maybe "transformation" is a better word for what is going on. And, maybe a "wikitext runtime" is a slightly better term than a parser?

... to be continued ...

What kind of specs can we develop?
... ramblings below ...
 * A formal language spec (like Java)
 * An executable spec (like the HTML5 tree building spec)
 * A test-based spec (the more common kinds for languages without a formal spec. Ex: RubySpec, an informal test-based spec for Ruby)

We need specs for converting wikitext to HTML, HTML to wikitext, and a spec for HTML as a format. Even Parsoid, which has a versioned spec for the HTML output it generates does so in only one direction, i.e. what the HTML output it generates means. It does not specify what kind of output it expects when it is to be converted to wikitext. For sure, it will accept the output that comes from its wikitext to HTML pipeline, but it doesn't specify constraints on what it can and cannot accept, and does not specify error behavior which is why Parsoid's response on arbitrary HTML can be quite ugly and comical.

... to be continued ...

What are the expectations of a formal spec?
With all that song and dance in the previous sections, we now arrive at another essential question: what do we hope to get out of a formal spec? One way to look at this is to enable a move to the next generation wikitext runtime with an eye towards supporting newer applications. Anyway .. blah blah ..
 * Goals from a well-specified wikitext model
 * Enable document composability
 * Enable editing tools that can minimize edit conflicts by enabling fine-grained editing (beyond section level edits)
 * Enable real time collaboration (this is a fallout of the previous bullet)
 * Enable pluggable third party implementations of the runtime for different resource constraints. I think a shared hosting wiki and a wikimedia wiki are entirely different beasts and there ought to be options for what kind of "wikitext runtime" are suitable for each of those scenarios without imposing the entire development burden on WMF. Maybe naively, I think a spec could provide a somewhat elegant solution to the vexing problem of 3rd party wikis, admittedly it only solves part of the problem.

... to be continued ...

Credits
While I am the initial author of this document, much of this comes from experience from collective work on the Parsoid project, and lots of discussions within and outside the parsing team.

Arlo, Rob, Kunal encouraged the idea of thinking about a spec in the context of the 2-systems problem.

Rob especially emphasized that the 2-systems problem should be seen not as a problem, but as an opportunity for making some forward progress on a long-standing problem, i.e. the 2-systems problem can clarify the inherent and accidental complexities of the runtime.