Reading/Web/Projects/A frontend powered by Parsoid/HTML content research

As part of the preparation for the mediawiki developer summit and the end of the research of the quarter, we are taking a deep dive into analyzing our HTML content, what is it composed of, and how those parts affect size and rendering time for our readers.

Data to compare

 * restbase article
 * api.php (action=parse)
 * api.php (mobileview)
 * loot transformations (individually)
 * No transformations
 * No ambox
 * No navbox
 * No references
 * No images
 * No superficial markup
 * Empty nodes
 * Reference spans
 * → ]
 * Transform mw:Entity to their contents
 * No data-mw
 * Transform mw:Entity to their contents
 * No data-mw

What we want to measure

 * HTML size
 * Webpagetest speed
 * Device experience? (Timeline on devtools?)

How



 * Script that takes in a list of titles and queries those endpoints and stores the output in a folder.
 * HTML size analysis
 * After ^, responses are on cache in reading-web-research server. Execute webpagetest urls.

HTML size report
Initial report is here with a sample of pages decided in T120504.

This report compares HTML sizes of the wiki content served from different endpoints. First it shows a general overview with a comparison from parsoid+restbase, mediawiki action=parse, mediawiki action=mobileview and the loot transformations (removing and cleaning pieces of the content that can be loaded on demand or automatically after page load).

The report highlights a few things: And opens a few questions:
 * Parsoid+Restbase output is always bigger than the one from the MediaWiki endpoints
 * Parsoid+Restbase enables performant and cacheable transformations that allow loot to transform and restructure the content and get it to a fraction of the size, while keeping the endpoints performant by being cached.
 * This allows loading different parts of the content separately from the main page content.
 * attributes are an important fraction of the content served by restbase. Work is being done on enabling serving such information separately from a different endpoint and making the HTML leaner.
 * References consistently take a tremendous percentage of the total size after stripping  attributes. It seems fair to assume that not loading references on initial payload would be a net win. Different strategies could be applied depending on further research (loading them automatically after content has loaded, loading them on demand when the user wants to check them).
 * This seems to only be an option when using restbase+parsoid as the api endpoint for fetching content, enabled by the cache infrastructure and the better parsing and transforming capabilities.
 * The not-mobile friendly navboxes also take a considerable percentage of the content. Seems like not serving them or serving them on demand is fair under constrained devices.
 * Why exactly is parsoid+restbase content that much bigger than the mediawiki api content?
 * What percentage do navboxes and references amount to when using the mediawiki apis?
 * What do these metrics look like with a much bigger sample of articles?
 * What percentage of content does superficial markup (see How section for definition) amount to? Is it worth cleaning it up?
 * What percentage of views make use of navboxes and references?

Notes from conversations

 * separation is a big win and is being worked on
 * Unique ids per element and  would be interesting to measure
 * Parsoid seems to reduce the number of tags in dom, which is also beneficial for performance in constrained browsers. Would be interesting to measure