Requests for comment/Text extraction

From mediawiki.org
Request for comment (RFC)
Text extraction
Component General
Creation date
Author(s) Max Semenik (talk)
Document status accepted
Tim Starling in IRC

See Phabricator.

Currently, Wikimedia sites have API action=query&prop=extracts that can be used when someone wants a text-only or limited-HTML extract of page content. On November 5, 2014 this API was used over 11 million times, more than half of requests were for plain-text extracts. This RFC discusses the future of this API.

Core integration vs. separate extension[edit]

This part of the RFC has been withdrawn as no longer relevant

Initially, the extract functionality was located in MobileFrontend for practical reasons - it already had a HTML manipulation framework. However, now that it had been cleaned up and integrated into core (includes/HtmlFormatter.php), there's no reason it shouldn't be moved to some more appropriate location.

Arguments for integration into core:

  • This is a very basic functionality useful for almost every wiki.
  • Core already has HtmlFormatter.

Arguments for creating a separate extension:

  • Keep everything modular.
  • Easier to develop, no need to depend on the pace of core changes.
  • Can easily contain code for WMF-specific extraction process (see below).

WMF-specific extraction[edit]

Currently, text extraction consists of two steps:

  1. Manipulate DOM to remove some tags based on their name, id or class along with their contents. This is needed for example to remove infoboxes or navboxes.
  2. Remove some tags but keep their content ("flattening"). If a plain-text extract is needed, all tags are flattened, otherwise only some tags like <a> are.

This is adequate for Wikipedia and many other uses that have mostly large chunks of text, however it breaks for some sites like Wiktionary that need more elaborate formatting.

# We already have this:
class ExtractFormatter extends HtmlFormatter {
    ...
}

# But how about this:
class WiktionaryExtractFormatter extends ExtractFormatter {
    ...
}

If extracts will be integrated into core custom extraction classes could go to a separate extension (e.g. WikimediaTextExtraction); otherwise they could be part of the main extraction extension.

Extract storage[edit]

Currently, extracts are generated on demand and cached in memcached, however this results in a bad worst-case behaviour when a lot of extracts are needed at once like for queries over several pages or action=opensearch which returns 10 results by default. Text extraction involves DOM manipulations and text processing (tens of milliseconds) and potentially a wikitext parse in case of cache miss (can easily take seconds or even tens of them).

Such timing is less than optimal, I propose to extract text during LinksUpdate and store it in a separate table. This will allow efficient batch retrieval and 100% immediate availability.

CREATE TABLE text_extracts (
    -- key to page_id
    te_page INT NOT NULL,
    -- Limited-HTML extract
    te_html MEDIUMBLOB NOT NULL,
    -- Plain text extract
    te_plain MEDIUMBLOB NOT NULL,
    -- Timestamp for looking up rows needing an update due to code or configuration change
    te_touched BINARY(14) NOT NULL,
);

The new table can be on an extension store as it doesn't needs to be strictly in the same DB as wiki's other tables. To decrease storage requirements, extracts should be generated only for pages in certain namespaces (and because the current extraction algorithm was tailored for Wikipedia mainspace anyway).

See also[edit]