Topic on Talk:Developer Portal

Cscott (talkcontribs)

This is a bit provincial, as I've been working on Parsoid for almost ten years now, but I'd like to see links to Parsoid HTML incorporated somehow. "Working with mediawiki data" is one of the key points here, and it seems like "turning wikitext into a semantically-meaningful HTML document" is a key part of that for half of the folks who want to work with our content. (The other half are probably machine-learning folks who want to strip *all* the markup out and just get raw text out they can feed to a language model, but believe it or not Parsoid HTML can help with that too.) It seems like this should be mentioned in three places:

  1. There's a link to "MediaWiki and Extensions" in the "PHP language" section, but mediawiki is increasingly composed of numerous independent libraries (like Parsoid) which probably should be called out. Many of these libraries are significantly easier to contribute to than core is.
  2. The MediaWiki dumps page doesn't mention parsoid-format HTML dumps at all. I know this has a somewhat-twisted history at the org, but my understanding is that the kiwix project is using parsoid HTML from the API, and https://dumps.wikimedia.org/kiwix/zim/wikipedia/ is one way to get at that. There are projects like openzim's mwoffliner which work with this format. T182351 is a newish bug (2017!) for this, T17017 (2008) is the earliest I could find in a short search; T302237 is a task from this year which uses the Wikimedia Enterprise HTML dumps which are indeed in parsoid HTML.
  3. There's a detailed specification for the HTML generated by MediaWiki at Specs/HTML/. I don't know exactly where that should live in the developer site, but it seems like "working with HTML" belongs somewhere. That's what gadget authors will see, and it will be increasingly visible to readers as Parsoid read views rolls out.

I'm sure some of the responsibility here falls on the content transform team for not making 'excellent' user-friendly documentation for some of these tools/formats. So perhaps the question is: how do we fix this? What's the bar for inclusion, and how do we get creating the necessary documentation resourced?

Reply to "Parsoid/HTML"