ORES/Components

From mediawiki.org
(Redirected from Extension:ORES/Components)
Warning Warning: The ORES infrastructure is being deprecated by the Machine Learning team, please check wikitech:ORES for more info.

DRAFT description of the ORES system and components. For an overview of the network architecture as deployed to WMF servers, see wikitech [1].

Overview of data flow

Change propagation[edit]

Data flow detail diagram explaining how ORES learns of new revisions.

When a new revision is encountered in the mediawiki.revision_create Kafka change propagation stream, we tickle the revscoring API to pre-cache scores for that revision, once for each revscoring model we support on that wiki:

https://github.com/wikimedia/mediawiki-services-change-propagation-deploy/blob/master/scap/templates/config.yaml.j2#L327

The new revision trigger hits URLs of this pattern:

   https://ores.wikimedia.org/v2/scores/<wiki>/<model>/<revision_id>/?precache=true

for example,

   https://ores.wikimedia.org/v2/scores/enwiki/damaging/745065890/?precache=true

We're currently having some discussion about how to simplify the configuration for this system.

ORES API[edit]

See online documentation: https://ores.wikimedia.org/. The ORES API is a container for multiple machine learning models that take article revisions as their input. URLs are RESTful, and will always return the same results until the model is updated. The 'precache' parameter is passed through to metrics collection, with no other side effects. All requests that are intended to pre-populate ORES' cache should include precache=<some identifying string>.

Celery[edit]

Celery is used to manage concurrency among ORES API workers. We're guaranteed to never run more than CELERYD_CONCURRENCY workers at the same time (per machine), and the web frontends can be decoupled from scoring workers.

The service will refuses to create new jobs or serve requests once the queue size goes above a configured queue_maxsize.

ORES uses celery's task ID naming system to avoid recalculating scores when (nearly) simultaneous requests for the same score arrive. Instead, requests for the score will both read from the same task ID once computation of the score has completed.

Revscoring engine[edit]

Read more about the various machine learning models on the metawiki ORES page.

When a model is updated to a new version, all cached scores for that model are invalidated. It is up to clients that cache scores to invalidate based on version numbers as well.

Worker[edit]

MediaWiki Extension:ORES[edit]

This frontend displays revscoring data on the Special:Contributions and Special:RecentChanges pages.

We create a FetchScoreJob in response to the RecentChange_save event, which fetches scores from the ORES API and caches them in the local MediaWiki database for efficient access.

Caching[edit]

Varnish[edit]

Unlike most WMF services, we don't use the Varnish front-end cache.

Redis[edit]

The ORES backend stores scores in Redis as they are calculated, and will serve scores from that cache. Each score is saved under a key like, ores:<wiki>:<model>:<revision_id>:<model_version>.

MediaWiki database[edit]

As Extension:ORES pulls scores from the ORES API, they are stored in the ores_classification table.

The CheckModelVersions job checks for updates to the models, causing us to purge cached scores from previous versions of the model.

Wikilabels[edit]

Humans build the training and validation sets used to train the models, by answering questions about sets of revisions in the wiki labels interface.