Berlin Hackathon 2011/Notes/Saturday/Parser

Parser discussion in Berlin, mid-May 2011
General notes from that day: http://etherpad.wikimedia.org/mwhack11Sat

Notes distilled down & linked from http://mediawiki.org/wiki/Future:
 * http://www.mediawiki.org/wiki/Future/Parser_plan
 * http://www.mediawiki.org/wiki/Future/Parser_test_cases
 * http://www.mediawiki.org/wiki/Future/AST

Two parts of the discussion:
 * Questions
 * Volunteers

Questions
Q: JanPaul: why not formalism only, rather than formalism+annotation? A: Grammar is ambiguous - but that's usually ok as long as you have precedence rules. Context is more of a difficulty -> ways to handle odd nesting, or the funky apostrophes must be done as annotations to the formal parser. The only way to deal with weird syntax (other than breaking XML-style) is to annotate. Q: Does it make sense to convert everything into a new grammar that is 99% the same except for the apostrophes? A: We need to keep those, but can simplify lives of editors by encouraging use of the less ambiguous syntaxes.

Q: Hannes: have you considered making templates like programming languages? A: We are planning to have different types for templates, so for example, having some be wikitext, others maybe Javascript, etc.  (Brion describes how this might work to use multiple templates to generate a single table)
 * example: If you have only pipe-minus in a template, it is actually a table row (command: start table row)

Hannes: if output of templates are typed, then you can merge different types of templates in the same document

Neil: the trouble I had at first was figuring out how annotations was going to avoid being too loose, and too dependent on the implementation

Brion: I think it can be more formal than that, and reference the syntax tree rather than the implementation.


 * Syntactic phase -- format grammar parses source to first-stage tree, can be implemented as easily as possible.
 * example: parse text into a series of list item nodes
 * Semantic phase -- minimize to the smallest possible amount of additional rules that we can get away with. Spec can describe these rules in pseudocode (or actual code) -> future reference implementation
 * example: having found adjacent list item nodes in the expanded tree, fold them into a list.

Tim: there should be an event-based interface between the first and the second stage. This reduces need to hold multiple full trees in memory; tree structures are much more expensive than strings.

Brion: I'm worried that we'd need to backtrack to the beginning of the doc in some cases

Tim: we don't do that in our current preprocessor

[ok now they're getting scary]

Q: Purodha: i18n aspects - 1. concerning messages, 2. concerning templates(?) Q: Neil: Why must we have parser support for this? Purodha: need to consider directionality and other things [also language tagging is important for screen readers -> accessibility]

Tim: I think we can and should completely change the way we do parameters in messages. Currently we use $1, $2, etc. We should use template syntax. (Brion concurs)

With cooperating with the calling code, this would allow replacing in a node tree object that's annotated with the source language of the string (is it from the UI language or content language? some page might be tagged as being in a specific language, etc)

Achim: where should we collect all of these questions? Talk page for Future? Brion: I will include these notes into the Future (later note: "should be more or less merged some time sunday")


 * D

rmat grammar parses source to first-stage tre (unclear)

Maciej: we have a couple of issue with how whitespace is handled. Diffs are particularly difficult to preserve. We would like to have more features, such as better template editing. The problem is that templates are black boxes

[from preprocessor you can get some vague lists of used parameters, but you know nothing about them. it's very sad and you can't produce a very nice dialog from just that info! Parser hooks can be worse -- you don't have *any* info on the inside of them.]

Brion: we've tossed around ideas about doing better template editing.

Trevor: we created a template info extension, so that you can optionally add some inline documentation of template parameters. We then have an API call which can return the template parameters. WeTemplate authors would be motivated to create sweet user interfaces for their template.

Would probably need similar for parser functions & tag hooks -- but where do we define it?

Trevor: the problem with Parser Hooks is that they're even more of a black box than templates, so parser hook authors need to write the UI for the input.

Parser hooks & functions also will need to have compatible implementations for many users of the parser -- keep this in mind! [The extension implementer should be able to provide the helper info for the editor UIs.]

Tim: need to be careful about scope creep

Mike: but there are some common cases

Trevor: example: ref tag

"Core spec" for wikitext itself, plus addenda for popular extensions that will need to be implemented by most Wikipedia reusers: ref, math etc. Core spec must define what context gets passed to extensions.

Note that parser function syntax is better defined than tag hook syntax.

Opaque vs non-opaque tags may need to be handled...

Maciej: some block level indenting using dl and dt really messes things up

Brion: answer to that is really LiquidThreads

Wladyslaw Bodzek: what are the essential requirements? There are some people that want to parse in other environments. What would be the best solution to create a parser to create some kind of mechanism or some well defined process to get wikitest in some sctuructured data. I think is some sort of formal grammar is the best, so I'm worried about the last "magic" step.

Brion: I think what makes sense...first we have source to initial tree and then to canonical tree, and then a final transformation to the output format. That makes it possible to have many different output formats. If we define canonical form and output format as separate steps, then we can keep it sane.

Trevor: need to define a basic API between the parsing environment and the parser: context passing
 * current time
 * page name
 * language
 * context API for extensions...

(context, source) --parser-> (context, AST) --outputter-> (context, HTML/PDF/etc)... remember that the parser & the outputter need to communicate with the context!

Tim: there are a lot of features that people count on that rely on the current regime

Late static binding! [much scary stuff]

Later steps in parse -> walk the tree to find titles that need looking up & therefore additional transformative action (such as link coloring & template expansion... right now we do link color late but template expansion in the preprocessor)

Parse everything into an intermediate language. Then interprete variables and parser hooks

Maciej: is the intermediate representation reversible to original source?

For intermediate -- yes definitely!!!

For HTML output... that's harder, but needs whatever info for the way the output is used (editor needs fully reversible; view w/ inline editor might just need to find the origin node.)

Neil: a simpler parser seems possible, but it becomes impossibly distant when we insist on not breaking anything?

Brion: that's a slippery slope to complexity, and it sounds like a nightmare for users.

Jan-Paul: you can gather metrics on what conversions are working and which aren't

(long scary discussion)

TERMINOLOGY FIXES :D ... per Tim -- we need to narrow down what we mean with Parser, Preprocessor etc to avoid confusion

Brion

Neil: are we going to be a cautionary tale like Perl 6? What are the goals? When I think of simplifying, I think of simplifying wikitext.

Brion: we've spent the past 9 years not replacing the syntax. We want to tone it back somewhat, and define what we have today. That way, we've got something easier to work with. That's our step 0 toward a multi-step process to simplifying the syntax. We know that these pages will always be in the history.

Tim: what about a parser for UseMod so that we can view the really early (circa 2001-2002) history of Wikipedia?

Brion: that would be AWESOME!

Purodha: one-time format conversions are dangerous if anything got missed; l start slipping in more versioning info for consistent reads etc

Actions!

 * merge these notes to http://www.mediawiki.org/wiki/Future [brion -> this weekend]
 * long-term goals: build a "in 2012, in 2017, in 2027" long-term goals
 * coarse list of the goals, constraints

Tim: there are a lot of features that people count on that rely on the current regime

Late static binding! [much scary stuff]

http://www.mediawiki.org/wiki/Future/Parser_test_cases
 * collect test cases
 * Maciej: we have over 300 test cases
 * Mike: our parser ends up being a great test case finder, because it bails when it can't find anything.
 * Mike: we can provide a list of documents that fail
 * any existing test cases used by parsers or editors
 * MediaWiki's own parser test cases
 * corpus of real-world Wikipedia pages

Step 0: gather up test cases Step 1: define AST Step 2: convert parser tests so that the expected output is AST rather than HTML Step 3: write a parser that passes the tests

http://www.mediawiki.org/wiki/Future/AST
 * collect & survey the existing AST/DOM models
 * see if we can get a first-order approximation and go with it for now
 * Neil: AST should be expressable in JSON
 * XML does have some ups, but we're pretty sure we wannt try JSON

http://lists.wikimedia.org/pipermail/wikitext-l/
 * Revive wikitext-l mailing list