Reading/Web/Projects/A frontend powered by Parsoid

dberry947@gmail.com The reading web team would like to get to stage where our frontend is completely API driven and powered by Parsoid. For pages in the main namespace only (to begin with) we plan to serve lead sections only, and render the rest of articles via JavaScript. We believe this will speed up access to our sites and dramatically increase readers coming from areas with 2G connections.

We will do this by:
 * Running experiments to assess impact of changes - both to users and to our servers
 * Building an HTML prototype to get Parsoid to feature parity with our existing PHP parser for our readers
 * Explore using Service Workers to render more on the client and reduce load on our servers.
 * Eventually roll these changes out to our mobile beta users, a small audience which will give us more confidence in our changes.
 * Roll these changes out to all mobile users for even more confidence and a faster reading experience for everyone
 * Adapt this for the desktop site to give a faster reading experience there

This work is captured in phabricator.

The ability to do DOM transformations cheaply
In mobile, recent research has shown that many of our pages are image heavy. This has a knock on impact to our first paint - a measurement that captures the time taken from a user entering an address in the URL bar to seeing content on the page - this is because downloads for image delay the downloads of page critical content such as styles. Parsoid by being built in node allows us to cheaply transform image urls - either by removing them to support users who want to disable images on pages, or send low bandwidth copies of images that can be enhanced by the client via JavaScript.

This also becomes extremely important in Wikipedia Zero. Some networks provide Wikipedia for free but only without images to save on bandwidth and serve their users more. We currently do this in a hacky way, which has broken due to the code being bespoke. We need to serve these users better. We don't want to be breaking stuff and confusing users or operators unnecessarily.

Sections as first class citizens
This is tracked in phabricator.

The mobile website is currently powered by the MediaWiki default PHP parser. The mobile site differs from the desktop site in that it has a concept of sections on the HTML level. It achieves this via a piece of code called the MobileFormatter that runs after parsing. This code is brittle and uses regular expressions. It should really be done at the parser level. This is best done in a language that understands the concept of the DOM tree. This is one of the reasons we believe we need a frontend powered by Parsoid.

The mobile skin treats sections as first class citizens. The desktop site allows editing of sections on a section level but does not easily allow rendering of individual sections. Unintuitively, without MobileFrontend or resorting to some kind of JavaScript based parsing you cannot request the sections of a page on the mobile site.

Sections becoming important when you think about performance. We can delay the rendering of much of our content. For those users that do not read beyond the lead section, they could retrieve pages in the realms of two minutes quicker if they were just served the lead section.

Lead sections
Any content before the highest level heading in the content is referred to as a lead section. The highest level heading is used to mark up sections. For instance if a page features h2s (two equal signs in wikitext), these become the highest level headings. Content in between these headings is wrapped in a div to support section collapsing.

Example lead section: Lead section content.

big heading
More unusual example of lead section: Lead section content.

I am also part of the lead section as I am a smaller heading then "big heading"
This is also part of the lead section.

Mobile view api
The mobile view api allows the surfacing of sections as first class citizens via the API. Again it resorts to regular expressions that run after parsing.

A client that can be built with JavaScript powered by the API (Service Workers)
MediaWiki is written in PHP. We believe Service Workers are an important part of the future of MediaWiki. They will allow us to minimise HTTP requests to our server by supporting an offline reading experience and allowing us to fully render the mobile site using JavaScript. Right now our API is not complete. Building standalone HTML web apps and native applications such as our iOS and Android apps on Parsoid allow us to fill these gaps in our API and break out a lot of important interfaces in our code that might need better abstraction. A good example is certain pages require certain JavaScript ResourceLoaders to fully function - how does a client access these?

More non-MediaWiki based clients
We want to be able to support native apps, node js applications built on top of MediaWiki. So far we have worked hard on our APIs to support editing but the same cannot be said for people building interfaces for reading. Currently our API serves page content that developers must santize for their own needs - whether that be section wrapping to support section collapsing or removal of non-mobile friendly/inconsistently designed navigation elements e.g. the navbox template on Wikipedia.org

Faster editing
When clicking edit on any MediaWiki page, VisualEditor must do a roundtrip to the server to ask Parsoid what the content of the page is. Theoretically if the page content was already built via Parsoid it could skip this altogether and use the content in the DOM. This would give editors a faster editing experience.

Platform consistency
Our apps and web experience need to be more closely aligned. Many of the hacks in place in MobileFrontend and apps are currently bespoke and would be better done in the same level - Parsoid.

Related

 * https://phabricator.wikimedia.org/T111588
 * Reading/Web/Projects/A frontend powered by Parsoid/Q2 MVP
 * Reading/Web/Projects/A frontend powered by Parsoid/Parsoid html size initial report