Jump to content

Talk:Parsing/Notes/Wikitext 2.0

About this board

Nice idea but still missing the root problem

11 (talkcontribs)

The root problem with wikitext that this spec doesn't address is the long tail of wikitext misfeatures including (not an exhaustive list):

  • linktrail - This is incredibly unintuitive, doesn't work consistently in different wikipedias, is the source of many bugs and pointless arguments over what is the true representation, and ultimately was a lazy solution to editors need to have reduced markup.
  • Pipetrick : This is a beloved editor feature, but it is another source messy issues due to the nature of parser precedence. More importantly it can never roundtrip and is completely confusing to novice editors because what they pasted isn't what will later exist in the page source.

Problematic core features:

  • {{tag}} parser function - This was the worse hack of them all, a source of many bugs, and should just be killed ASAP, because it allows infinite possible variations of broken and "surprising markup".
  • noinclude, onlyinclude, and includeonly - this is a big mess, in fact despite heavy usage, often one tries either onlyinclude or includeonly and tests to see which renders "properly".

Permanently disallowing meta-data within wikitext This should have never been added to wikitext markup.


This is basically almost impossible to remove due to the incredible dependence on it. A first step would be to create an GUI (e.g. like hotcat) , a second step would be to disallow it on wikitext for content pages, the third step would be to create a core function that deliberately ignores specific categories added by templates. Possibly later replace those too categories using some rule based GUI tool.

The same applies to interwiki links, and all other parser functions / magic words that set metadata.

Completely eliminate inline styling

It causes a considerable amount of issues both to novice and experienced editors, causes issues with mobile devices, and makes it possible to add data outside the content area, e.g. z-index.

Eliminate misfeatures by announcing them months beforehand, slowing killing them one by one, and avoid past mistakes e.g. reinventing the wheel.

SSastry (WMF) (talkcontribs)

Thanks for reading and for the helpful feedback. I think there are two related but somewhat independent aspects of this. At this time, I am focused on the semantics and the underlying processing model for wikitext. @Cscott has been exploring the syntactical issues with wikitext and might be able to incorporate some of your input. I think it would be difficult to tackle both of these at the same time since it would be fairly disruptive. (talkcontribs)

Well, I'd say that any good standard or spec starts with first identifying the misgivings and problems of the prior approach. It might be a good idea to start up a document concretely identifying them, and their potential solutions.

A concrete problem that does affect processing is the <tag> extension. If one planned to improve the wikitext parsing for tools like VisualEditor one would probably endeavor to batch parallel all parsing of the separate templates and sub-templates. This is currently a problem because the tag extension was specifically designed to change the parsing order and output.

Balanced templates is certainly an interesting idea. In fact, Wikia seems to have designed a new page component (portable infobox ) to force the "infobox" into a single page, and yet the tag parser function can easily break it by put it into a hundred templates.

Finally, one interesting part of the spec is template output. It also seems that wikia pursued a similar idea in template types. That mostly focuses on identifying and altering output for mobile devices. Perhaps this proposal would benefit from an abstraction of that one, as it currently focuses too much on user defined types. Also none of the specs I've seen seem to deal with the template arguments which can cause a good number of issues.

The point is that MediaWiki developers have been conservative with wikitext for a decade, and it is time to be disruptive.

SSastry (WMF) (talkcontribs)

> The point is that MediaWiki developers have been conservative with wikitext for a decade, and it is time to be disruptive.

Except that the wikimedia wikis have a huge corpus of wikitext which we cannot break. (talkcontribs)

Indeed, very true, that's yet another shortcoming of the current wikitext specification. In fact, that proves the point I was making above (evaluate shortcomings before developing solutions). Wikitext as the canonical storage format has serious drawbacks.

One solution is using the proposed linter extension or a mass differ to detect the rate of differences on a database dump, developing solutions to clean up the issues, and deploying the new version if the diffs are less than XX percentage.

The long term solution however, is moving away from wikitext as the canonical storage format or setting up versioning in the wikitext content model (if that doesn't already exist). Perhaps developing the technology to allow versioning is the first step before any new spec.

It still doesn't take anything away from the idea of being bold, planning and making breaking changes rather than continuing adding massive hacks in the mediawiki parser, parsoid, and VisualEditor.

SSastry (WMF) (talkcontribs)

These ideas of canonical storage formats and migrating wikitext after analyses have been around for a long time now.

As for wikitext versioning, yes, that is indeed the idea behind some of these proposals. Versioning support already exists in MediaWiki in the form of the content handler abstraction.

There is some discussion in this wikitech-l thread ([1] and responses to it, [2] and this followup spec RFC page) if you want to read more about ongoing discussions.

SSastry (WMF) (talkcontribs)

So, my understanding is that there can be no "breaking changes" for wikitext at this point without a way of transitioning the corpus and editor workflows that exist. So, gradual evolution seems the only reasonable way forward without trying to do all at once. But, yes, some pieces will have to go together. For example, wikitext versioning is essential for any significant changes. Or, if / when wikitext evolves significantly and we want to get rid of the older parser, at that point, we need to actually either store HTML versions of those revisoins or store a transitioned wikitext version for all revisions. In our context, disruption doesn't seem like a good thing.

However, if done properly, we can enable possibilities in MediaWiki for newer non-Wikimedia wikis to use other markup formats other than wikitext (for example, using a cleaned up wikitext syntax, or markdown, or even no wikitext). That seems like a path worth pursuing on its own. Some, but not necessarily all, of this will get enabled as part of our work (see Parsing team long term directions). (talkcontribs)

Storage costs aren't insurmountable, with clever algorithms. I do also agree that defining the old wikitext "spec" is a waste of time, and will lead to pointless arguments over what is the one true Markup and output.

>So, my understanding is that there can be no "breaking changes" for wikitext at this point without a way of transitioning the corpus and editor workflows that exist

Partly. A good amount of the corpus is already broken anyway, as a lot of old revisions don't work due to changes to wikitext (over a decade), templates, javascript, css , and so on. I'd say there is a degree of breaking changes that are acceptable, e.g. [<!-- a comment -->[link]], breaking this non-feature doesn't need any tooling at all. A general rule of a thumb for improvements would be:

  • Is it officially documented ?- if no break it
  • Is it purely aesthetic ?- if yes break it
  • Does it affect content pages / templates ? - if no break it
  • Does it affect more than XXX % of content pages, e.g. less than 1000 pages out of a 5 million corpus - if no kill it

Breaking user pages is hardly a problem, for example, many current revisions contain a lot of broken and deprecated markup anyway.

> In our context, disruption doesn't seem like a good thing.

Wikidata caused a massive disruption with their movements of interwikis there, yet it was a very good thing. With a disruption like deprecating and removal of wikitext based metadata (e.g. categories), it might be easy to sell that innovation as "you can remove a category from 1 million pages without 1 million edits or even a bot" and without huge performance costs((https://phabricator.wikimedia.org/T151196)).

> However, if done properly, we can enable possibilities in MediaWiki for newer non-Wikimedia wikis to use other markup formats other than wikitext

Indeed, a pragmatic view would be, how would mediawiki developers improve their platform if Wikimedia was merely a user / consumer and not really responsible or paying for development of the software?

One would expect that they would make the wikitext markup as clean and performant as possible first, and then later develop migration tools.

Wikimedia developers current position is both a blessing and a curse. Getting to decide how the markup will evolve, yet being shackled to the enormous "unbreakable" corpus. (talkcontribs)

Flow for example will keep all the markup we've added to this discussion regardless of changes in the wikitext spec.

Cscott (talkcontribs)

As @SSastry (WMF) mentioned, I've begun to look at the syntax issues, with an eye to coming up with a small set of regular forms instead of the ad-hoc and non-orthogonal bunch we've grown. I think transitioning existing content is the key. (talkcontribs)

Great, the ideas seem rather good, my only assertion is that the developers should try to avoid tunnel vision due to the possible unique problems of their primary consumer.

The linter extension tool seems to be a very good tool for the job, especially if it is possible to surface possible markup deprecations using it, and these are automatically surfaced with something like echo notifications.

One future idea would be surfacing markup problems within the editor while editing, much like a jslint tool.

Reply to "Nice idea but still missing the root problem"
Duesentrieb (talkcontribs)

The use of "DOM tree" is a bit unclear to me. Does it refer to a wikitext DOM, or an HTML DOM? Is that DOM then processed to generate HTML, or is it already HTML?

I think we will actually want both - constructs that return a wikitext DOM for further processing, and constructs that return an HTML DOM that will be used as-is. The latter would be useful for special page transclusion, parser tag extensions, and the transclusion of non-wikitext content in general. -- Duesentrieb 22:25, 24 August 2016 (UTC)

SSastry (WMF) (talkcontribs)

I use DOM tree to refer to the HTML DOM. When the DOM is serialized, you get the final output HTML. Our experience with Parsoid and VisualEditor and other tools has been that the HTML DOM is rich enough to represent wikitext semantics without needing a specialized format.

Reply to "HTML DOM trees"
There are no older topics