Reading/Web/Projects/A frontend powered by Parsoid/HTML content research

As part of the preparation for the mediawiki developer summit and the end of the research of the quarter, we are taking a deep dive into analyzing our HTML content, what is it composed of, and how those parts affect size and rendering time for our readers.

Data to compare

 * restbase article
 * api.php (action=parse)
 * api.php (mobileview)
 * loot transformations (individually)
 * No transformations
 * No ambox
 * No navbox
 * No references
 * No images
 * No superficial markup
 * Empty nodes
 * Reference spans
 * → ]
 * Transform mw:Entity to their contents
 * No data-mw
 * Transform mw:Entity to their contents
 * No data-mw

What we want to measure

 * HTML size
 * Webpagetest speed
 * Device experience? (Timeline on devtools?)

How



 * Script that takes in a list of titles and queries those endpoints and stores the output in a folder.
 * HTML size analysis
 * After ^, responses are on cache in reading-web-research server. Execute webpagetest urls.

HTML size report
Initial report is here with a sample of pages decided in T120504.

This report compares HTML sizes of the wiki content served from different endpoints. First it shows a general overview with a comparison from parsoid+restbase, mediawiki action=parse, mediawiki action=mobileview and the loot transformations (removing and cleaning pieces of the content that can be loaded on demand or automatically after page load).

The report highlights a few things: And opens a few questions:
 * Parsoid+Restbase output is always bigger than the one from the MediaWiki endpoints
 * Parsoid+Restbase enables performant and cacheable transformations that allow loot to transform and restructure the content and get it to a fraction of the size, while keeping the endpoints performant by being cached.
 * This allows loading different parts of the content separately from the main page content.
 * attributes are an important fraction of the content served by restbase. Work is being done on enabling serving such information separately from a different endpoint and making the HTML leaner.
 * References consistently take a tremendous percentage of the total size after stripping  attributes. It seems fair to assume that not loading references on initial payload would be a net win. Different strategies could be applied depending on further research (loading them automatically after content has loaded, loading them on demand when the user wants to check them).
 * This seems to only be an option when using restbase+parsoid as the api endpoint for fetching content, enabled by the cache infrastructure and the better parsing and transforming capabilities.
 * The not-mobile friendly navboxes also take a considerable percentage of the content. Seems like not serving them or serving them on demand is fair under constrained devices.
 * "Extraneous markup" (Parsoid-generated IDs,  and   attributes,   attributes, and   wrapping elements) accounts for roughly 10% of the article weight and are only ever used when transforming the article and not rendering it.
 * Of the 6753 ID attributes present on the Parsoid-generated Barack Obama article roughly 1254 of them are generated by the Cite extension. These ID attributes add 33 KB to the payload of the total 86 KB added to the payload by ID attributes. It might be worth considering the Cite extension's ID scheme
 * Why exactly is parsoid+restbase content that much bigger than the mediawiki api content?
 * What percentage do navboxes and references amount to when using the mediawiki apis?
 * What do these metrics look like with a much bigger sample of articles?
 * What percentage of content does superficial markup (see How section for definition) amount to? Is it worth cleaning it up?
 * What percentage of views make use of navboxes and references?

Notes from conversations

 * separation is a big win and is being worked on
 * Unique ids per element and  would be interesting to measure
 * Parsoid seems to reduce the number of tags in dom, which is also beneficial for performance in constrained browsers. Would be interesting to measure

Article composition on 2G
Several webpage test runs were made on Barack Obama on a 2G connection You can see that images barely impact the first paint on a 2G connection (that said there's **no stylesheet** to conflict with these pages do not use stylesheets so the download of a stylesheet is not conflicting with images for first paint. In the real world images would have more of an impact on first paint. ). However if you are optimising for reducing bytes, fully loaded time they are significant.

Current content as served by Parsoid with data-mw stripped result test page Current content as served by Parsoid with data-mw and images stripped result |nodatamw test page Current content as served by Parsoid with data-mw and references/navbox stripped result |noreferences|nonavbox test page Current content as served by Parsoid with data-mw, images and references/navbox stripped result |nodatamw|noreferences|nonavbox test page (Note ten of all these seconds are to do with first byte given its running on a labs instance)

Conclusions Next steps:
 * Although obvious, reducing the size of the HTML shows promise that it will impact first paint and first interactive time (based on the lower fully load time) significantly.
 * We need to test on a much larger sample to get a sense of the overall impact of these changes and the median improvement we can expect to see.
 * The entire article should still be available.
 * We'll need to analyse where savings can be made by identifying content that can be defer loaded.
 * We'll need to measure whether simply loading lead section will satisfy a certain percentage of users (and what percentage that is)