Requests for comment/Text extraction

Currently, Wikimedia sites have API  that can be used when someone wants a text-only or limited-HTML extract of page content. On November 5, 2014 this API was used over 11 million times, more than half of requests were for plain-text extracts. This RFC discusses the future of this API.

Core integration vs. separate extension
Initially, the extract functionality was located in MobileFrontend for practical reasons - it already had a HTML manipulation framework. However, now that it had been cleaned up and integrated into core (includes/HtmlFormatter.php), there's no reason it shouldn't be moved to some more appropriate location.

Arguments for integration into core:
 * This is a very basic functionality useful for almost every wiki.
 * Core already has HtmlFormatter.

Arguments for creating a separate extension:
 * Keep everything modular.
 * Easier to develop, no need to depend on the pace of core changes.
 * Can easily contain code for WMF-specific extraction process (see below).

WMF-specific extraction
Currently, text extraction consists of two steps:
 * 1) Manipulate DOM to remove some tags based on their name, id or class along with their contents. This is needed for example to remove infoboxes or navboxes.
 * 2) Remove some tags but keep their content ("flattening"). If a plain-text extract is needed, all tags are flattened, otherwise only some tags like &lt;a&gt; are.

This is adequate for Wikipedia and many other uses that have mostly large chunks of text, however it breaks for some sites like Wiktionary that need more elaborate formatting.

If extracts will be integrated into core custom extraction classes could go to a separate extension (e.g. WikimediaTextExtraction); otherwise they could be part of the main extraction extension.

Extract storage
Currently, extracts are generated on demand and cached in memcached, however this results in a bad worst-case behaviour when a lot of extracts are needed at once like for queries over several pages or action=opensearch which returns 10 results by default. Text extraction involves DOM manipulations and text processing (tens of milliseconds) and potentially a wikitext parse in case of cache miss (can easily take seconds or even tens of them).

Such timing is less than optimal, I propose to extract text during LinksUpdate and store it in a separate table. This will allow efficient batch retrieval and 100% immediate availability.

The new table can be on an extension store as it doesn't needs to be strictly in the same DB as wiki's other tables. To decrease storage requirements, extracts should be generated only for pages in certain namespaces (and because the current extraction algorithm was tailored for Wikipedia mainspace anyway).