Page Previews/API Specification

This article outlines the specification for a new Node.js based API to generate summaries for MediaWiki based wikis that replaces the existing TextExtracts API.

Background & Motivation
Up until now, we've mostly gotten away with using the  MediaWiki API provided by TextExtracts and RESTBase to allow us to scale out Page Previews to a couple of large Wikipedias without issue.

However, the requirement that certain classes of pages should be handled differently means that TextExtracts is no longer the most appropriate place to house the notion of what a page preview is. We should aim to keep TextExtracts as simple and as general as possible. It may be that we compose the  API and the new Page Preview API rather than integrating them but this is not a goal of this work.

To be clear, the primary goal of this work is to minimise the amount of text/HTML processing in the Page Previews client: the less work the client has to do to display a preview, the better.

Intros
The API returns well-formed HTML representing the introductory elements of a page, which are defined as follows: Herein we'll refer to these elements as an "intro".
 * The first paragraph from the introductory section.
 * The first ordered, unordered, or definition list that is the next sibling of the first paragraph.

Plaintext intros
Certain clients will not be able to handle HTML intros yet, e.g. the Wikipedia apps. To maintain compatibility with these clients, the API will also return a plaintext representation of the introductory elements of a page.

Empty intros
After the HTML intro has been processed (see below), it may not contain text content but still contain HTML, e.g. . Any processed intro that doesn't contain text content must be considered empty.

✅

Markup allowed in an intro
By default, the Page Preview API (herein "the API") must remove any tag that doesn't fall into one of the following cases.

Emphasis
The API must retain any bolded or italicised text in the intro, i.e. the Page Preview API must not remove,  , and   tags.

✅

Formulae/MathML
In order to support browsers that don't support MathML, the API: ✅
 * 1) Must remove   tags; and
 * 2) Must not remove either the inline or block layout fallback images generated by Math while parsing the page.

Super- and subscript
The API must retain all  and   tags that are not generated by Cite, i.e.   elements.

✅

Stripping of parenthetical statements
The API must remove all content enclosed within balanced parentheses. Parentheses will be defined as the following characters: and （ ）

✅

Flattening inline elements
The API must replace all  and   tags with their text content, e.g.   should be flattened to   and   would be flattened to.

✅

The API must remove any element with the  class to replicate the current behaviour of TextExtracts.

✅

Line breaks
It is assumed that any line breaks in the summary are necessary for the display of the content. We thus do not remove any instance of a line break that appears in the lead paragraph of a summary.

Parameters
✅

Responses
A successful response from the Page Preview API similarly to all existing endpoints, must have the following properties: The new summary endpoint will hydrate these properties with the additional fields specific to summaries: ✅

Where an  type property must have the following properties: And a  type property must have the following properties:

For a page in the wiki's content namespace(s)
The Page Preview API must respond with 200 OK.

The  property of the response must be set to.

If the page has a corresponding Wikidata item, then the  property must be set to the item's description.

✅

For a page outside of the wiki's content namespaces
The Page Preview API must respond with 200 OK.

The  property of the response must be set to.

The  property must be set to.

The  property must be set to.

✅

For a page that doesn't use the wikitext, wikibase-item, or wikibase-property content model
The Page Preview API must respond with 200 OK.

The  property of the response must be set to.

The  property must be set to.

The  property must be set to.

✅

For a disambiguation page
The Page Preview API must respond with 200 OK.

The  property of the response must be set to.

The  property of the response must be set to the first N links from the disambiguation page.

The  property of the response should be set to the intro of the page so that the client may display it if appropriate.

For a page that doesn't exist
The Page Preview API must respond with 404 Not Found.

The response body must be empty.

✅

For a page that doesn't have a lead section
The Page Preview API must respond with 200 OK.

The  property of the response must be set to.

The  property of the response must be set to.

Examples
✅
 * 1) https://en.wikipedia.org/wiki/Wikipedia:Dashboard

For a page that has an empty intro
The response must be the same as the "For a page that doesn't have a lead section" case.

✅

For a page that redirects to another page
The Page Preview API must respond with 302 Found.

The  HTTP header must be set to the URL that will get the intro for the target page.

Note: RESTBase handles redirects transparently to the underlying service (see T176517#3634838).

The Page Preview API must respond with 200 OK.

The  property of the response must be set to.

The  property must be set to.

The  property must be set to.

For a Wikidata item
This overrides the "For a page in the wiki's content namespace" case above.

The  property of the response must be set to "wikidata_preview".

The  property of the response must be set to the item's label.

If the item has the image property set (to I):
 * The  property of the response must be set to the   object that represents the Wikimedia Commons file referenced by I.


 * The  property of the response must be set to the   object that represents the corresponding thumbnail.

For a Wikidata item with no description
The response should be the same as the For a Wikidata item case apart from the following:

The  property of the response must be set to.