Requests for comment/A Spec For Wikitext

In this wikitech-l thread from August 2, 2016, I (Subbu Sastry) outlined how one might get to a formal spec for wikitext, but whether we want a spec for wikitext, and whether we ought to work towards that, is a different question. Rob Lanphier, in a followup, proposed a ArchCom discussion on this question, i.e. whether there ought to be a formal spec for wikitext. A spec for wikitext is also one of the proposals for how to deal with the 2-systems problem we have with wikitext parsing.

With the goal of guiding that ArchCom discussion, in this note, I want to address this question a bit, i.e. whether there ought to be a format spec for wikitext, and whether it is a worthwhile goal to work towards. The most important argument I want to make here is that this question of whether we want a spec cannot be separated from the question of what the goals are of wanting a spec (or put another way, what might a formal spec give us?). While that is probably a somewhat obvious observation for most of you, I want to make that explicit so we don't lose track of it in our discussions.

Complexities of "parsing" wikitext: The Parsoid Experience
As I noted in passing in the preamble to that wikitech-l thread, when talking about wikitext and the difficulty of parsing wikitext, there is often a focus on how there doesn't exist a BNF grammar for wikitext and how it is context-sensitive, etc. However, I think that is not as relevant compared to what I think are other sources of complexity in wikitext parsing. I alluded to them in that previously referenced email, but I am organizing them here a bit more coherently than in the email. These come from our experience developing (which is ongoing) Parsoid to support its editing clients and achieve parity with the output of the core parser.

You could replace wikitext with markdown and none of the problems outlined below would be solved. That is the reason I think that the focus ought to be on these other issues more than syntax.
 * String-based model of processing: Wikitext parsing semantics are primarily string-based. It is string-based in terms of how input is processed (macro-expansion model), and what kind of output is produced (HTML strings, not DOM structures). Here are some more specific ramifications of this processing model.
 * Macro-expansion model of templating: Parsing a page requires all transclusions (and extension tags) to be fully expanded before the page can be parsed. Wikitext like   foo bar   cannot be parsed to a bold string without knowing what transclusion returns, even though it looks like the entire string could be wrapped in a bold tag. But yes, this is not a problem for a spec per se, since a wikitext spec could formalize the notion of a preprocessing phase. We'll get to this later in this document.
 * Inability to bound effects of markup errors: Unbalanced tags and other errors in markup can cause "unpredictable" rendering effects on the page. But, more importantly, the rendering is dependent on how the errors are fixed up. For example, so far, MediaWiki has primarily used Tidy as the error fixup tool. But, the current Tidy bundle is based on HTML4 semantics and also introduces its own unnecessary "cleanup" of markup. However, the rendering changes if we used a different cleanup tool. So, a spec will have to deal with this issue.


 * Entanglement with parser internals: MediaWiki makes available hooks into the parser where extensions can hook into the parsing pipeline (ex: before/after some thing happens). This is a problem for a new parser implementation since those hooks may not exist in an alternative implementation, and when they do, the event ordering may not be identical.
 * Dependence on user state and database state: MediaWiki output can be changed by user preferences as well as state of the database (red links, bad images, for example). So, MediaWiki output cannot be generated without querying the database. An alternative way of modelling this is to treat these as post-processing transformations of the parser's output. A spec would have to deal with this issue. If I were to push this line of reasoning further, a related question to ask is whether the parser needs to know about media resources beyond knowing the type of resources they are (image, audio, video). As it exists, the output requires knowing a lot more about media resources than is available from the wikitext markup. An alternative way of handling this might be to use a post-processing transformation to further tweak the parser's output.
 * Lack of awareness of the HTML -> wikitext conversion pathway: The current parsing model is unaware of the need to convert the parsed HTML back to source wikitext (and with very strict requirements on the form of this HTML -> wikitext conversion). How might a wikitext spec look if it factored in these requirements?

Is it a parser? Is it a runtime? Is it a transformer?
Some of the confusion and imprecise discussion around a spec is also a problem of nomenclature. It is easier to see this by looking a traditional language compiler. A traditional language compiler has many very distinct architectural phases. There is the parsing phase (traditionally comprising of a lexer and a parser), there is a semantic analysis phase (type checking, etc.), there is a code optimization phase, and there is a code generation phase. The parser is just one part of the pipeline and later passes build on its output. Parsing is a very limited part of the pipeline that takes a source level program to executable code.

Based on the previous section, it should be fairly clear that a well-specific BNF grammar (for wikitext) or using Markdown with a clean grammar does not really help building an alternative wikitext "runtime" (the blurb on the Parsoid page refers to Parsoid as a bidirectional wikitext "runtime"). However, it is not really a runtime.

In today's MediaWiki incarnation as it is used in the Wikimedia universe, we need to be able to transform wikitext to HTML and transform HTML to wikitext (sometimes with additional constraints when the HTML is derived from an edit).

Given these observations, "parser" is a fairly loose and inaccurate term for what we might try to develop a spec for. In any case, this discussion of nomenclature is once again to expand our focus beyond syntactic details and variations to the full process of transforming wikitext to HTML and vice versa.

Why come up with a spec
Some goals for writing a spec could be: But, while we could try to spec the wikitext -> HTML and HTML -> wikitext behavior as it exists today in the core parser and Parsoid, it gets us nothing very much beyond some documentation. As described in the earlier section, the processing model continues to be complicated enough to discourage any forays into alternative implementations.
 * Improved documentation for editors
 * Enable interoperability with tools and applications that interact with MediaWiki
 * Ease implementation of tools and libraries that need to operate with wikitext directly (vs. the HTML)
 * Enable pluggable alternative implementations of the wikitext -> html and html -> wikitext transformations for different resource constraints

But, if one approaches a spec with the goal of identifying unnecessary and accidental complexity that has crept in over the years and evolving a newer processing model, it seems far more useful to me. For example, one approach would be to come up with a spec that aims to enable document composition from smaller fragments and thus provides a structure to the document in terms of its nested fragments. This enables a bunch of things in turn:
 * High performance edits via incremental parsing

The resulting simplification may also enable pluggable third party implementations of the bidirectional transformations for different resource constraints. I think a shared hosting wiki and a Wikimedia wiki are entirely different beasts and there ought to be different options for what kind of "wikitext runtime" is suitable for each of those scenarios without imposing the entire development burden on WMF. Maybe naively, I think a spec could provide a somewhat elegant solution to the vexing problem of 3rd party wikis, but, for sure, it only solves part of that problem.
 * Editing tools that can minimize edit conflicts by enabling fine-grained editing (beyond section level edits) and associated benefits for real-time collaboration
 * Ability to bound effects of markup errors (and hence improved ability to reason about markup)

Separately, it may enable tools that require wikitext parsing that don't have to be called mwparserfromhell.

What kind of specs can we develop?
There are different kinds of specs that exist out there. In the course of developing Parsoid, test coverage has expanded to spec wikitext usage as seen in various wikis as well as specify behavior of the conversion process from edited HTML to wikitext. So, what we now have for wikitext is a spec based on test coverage.
 * A formal language spec (like Java)
 * An executable spec (like the HTML5 tree building spec) or a reference implementation of some or all parts of a language (like bytecode interpreters for the JVM) or BNF grammars for syntax.
 * A test-based spec (the more common kinds for languages without a formal spec. Ex: RubySpec, an informal test-based spec for Ruby)

The goals from the previous section could dictate the form that a spec could take. Of course, all these different forms above are not mutually exclusive. They target different audiences.
 * If the primary emphasis is interoperability with MediaWiki, then, a spec like the Parsoid DOM Spec is a good start. But, as long wikitext continues to be the underlying primary document format for a MediaWiki document, a spec would have to be developed for the HTML -> wikitext interface. For example, Parsoid right now does not give you a guarantee of what kind of wikitext will be generated when fed different A-tags. In some cases, it produces url links, sometimes external links, sometimes interwiki links, sometimes wikilinks (with many different variations). A spec would better nail down this process without having to wade through tests. That gap would have to be addressed.
 * If the goal is to enable alternative implementations or tools and libraries for wikitext processing, the spec would have to get into more details beyond the input/output spec. The processing model would have to be described. Templating and extension models would have to be described without tying them to an implementation. Separately, some formal description of the syntax would also be useful which could take one of many forms. An executable spec in the form of a reference implementation of a parser / tokenizer for synax. Or, if possible, a BNF grammar for whatever parts of the language lends itself to it.
 * If the goal is to enable an internal reimplementation with some gradual evolution of the wikitext processing model, then tests-based specification might be a lot more useful.

Other relevant documents: These are likely topics at the Wikimedia Parsing Team's offsite in October 2016.
 * Parsing/Notes/Wikitext 2.0
 * Parsing/Notes/Wikitext 2.0/Strawman Spec
 * Parsing/Notes/Two Systems Problem

Credits
While I am the initial author of this document, much of this comes from experience from collective work on the Parsoid project, and lots of discussions within and outside the parsing team.

Arlo, Rob, Kunal encouraged the idea of thinking about a spec in the context of the 2-systems problem.

Rob especially emphasized that the 2-systems problem should be seen not as a problem, but as an opportunity for making some forward progress on a long-standing problem, i.e. the 2-systems problem can clarify the inherent and accidental complexities of the current wikitext processing model.