Topic on Talk:ORES/Paper

ORES system: Open, transparent process

1
EpochFail (talkcontribs)

Our goals in the development of ORES and the deployment of models is to keep the process -- the flow of data from random samples to model training and evaluation open for review, critique, and iteration. In this section, we'll describe how we implemented transparent replay-ability in our model development process and how ORES outputs a wealth of useful and nuanced information for users. By making this detailed information available to users and developers, we hope to enable flexibility and power in the evaluation and use of ORES predictions for novel purposes.

Gathering labeled data

There are two primary strategies for gathering labeled data for ORES' models: found traces and manual labels.

Found traces. For many models, there are already a rich set of digital traces that can be assumed to reflect a useful human judgement. For example, in Wikipedia, it's very common that damaging edits will be reverted and that good edits will not be reverted. Thus the revert action (and remaining traces) can be used to assume that the reverted edit is damaging. We have developed a re-usable script[1] that when given a sample of edits, will label the edits as "reverted_for_damage" or not based on a set of constraints: edit was reverted within 48 hours, the reverting editor was not the same person, and the edit was not restored by another editor.

However, this "reverted_for_damage" label is problematic in that many edits are reverted not because they are damaging but because they are involved in some content dispute. Also, the label does not differentiate damage that is a good-faith mistake from damage that is intentional vandalism. So in the case of damage prediction models, we'll only make use of the "reverted_for_damage" label when manually labeled data is not available.

Another case of found traces is article quality assessments -- named "wp10" after the Wikipedia 1.0 assessment process originated the article quality assessment scale[2]. We follow the process developed by Warncke-Wang et al.[3] to extract the revision of an article that was current at the time of an assessment. Many other wikis employ a similar process of article quality labeling (e.g. French Wikipedia and Russian Wikipedia), so we can use the same script to extract their assessments with some localization[4]. However other wikis either do not apply the same labeling scheme consistently or at all and manual labeling is our only option.

The Wiki labels interface embedded in Wikipedia

Manual labeling. We hold manual labels for the purposes of training a model to replicate a specific human judgement as a gold standard. This contrasts with found data that is much easier to come by when it is available. Manual labeling is expensive upfront from a human labor hours perspective. In order to minimize the investment of time among our collaborators (mostly volunteer Wikipedians), we've developed a system called "Wiki labels"[5]. Wiki labels allows Wikipedians to submit judgments of specific samples of Wiki content using a convenient interface and logging in via their Wikipedia account.

To supplement our models of edit quality, we replace the models based on found "reverted_for_damage" traces with manual judgments where we specifically ask labelers to distinguish "damaging"/good from "good-faith"/vandalism. Using these labels we can build two separate models of that can allow users to filter for edits that are likely to be good-faith mistakes[6], to just focus on vandalism, or to focus on all damaging edits broadly.

We've managed to complete manual labeling campaigns article quality for Turkish and Arabic Wikipedia (wp10) as well as item quality in Wikidata. We've found that, when working with manually labeled data, we can attain relatively high levels of fitness with 150 observations per quality class.

Explicit pipelines

One of our openness goals with regards to how prediction models are trained and deployed in ORES involves making the whole data flow process clear. Consider the following code that represents a common pattern from our model-building Makefiles:

datasets/enwiki.human_labeled_revisions.20k_2015.json:
        ./utility fetch_labels \
                https://labels.wmflabs.org/campaigns/enwiki/4/ > $@

datasets/enwiki.labeled_revisions.w_cache.20k_2015.json: \
                datasets/enwiki.labeled_revisions.20k_2015.json
        cat $< | \
        revscoring extract \
                editquality.feature_lists.enwiki.damaging \
                --host https://en.wikipedia.org \
                --extractor $(max_extractors) \
                --verbose > $@

models/enwiki.damaging.gradient_boosting.model: \
                datasets/enwiki.labeled_revisions.w_cache.20k_2015.json
        cat $^ | \
        revscoring cv_train \
                revscoring.scoring.models.GradientBoosting \
                editquality.feature_lists.enwiki.damaging \
                damaging \
                --version=$(damaging_major_minor).0 \
                (... model parameters ...)
                --center --scale > $@

Essentially, this code helps someone determine where the labeled data comes from (manually labeled via the Wiki Labels system). It makes it clear how features are extracted (using the utility revscoring extract and the enwiki.damaging feature set). Finally, this dataset with extracted features is used to cross-validate and train a model predicting the damaging label and a serialized version of that model is written to a file. A user could clone this repository, install the set of requirements, and run "make enwiki_models" and expect that all of the data-pipeline would be reproduced.

By explicitly using public resources and releasing our utilities and Makefile source code under an open license (MIT), we have essential implemented a turn-key process for replicating our model building and evaluation pipeline. A developer can review this pipeline for issues knowing that they are not missing a step of the process because all steps are captured in the Makefile. They can also build on the process (e.g. add new features) incrementally and restart the pipeline. In our own experience, this explicit pipeline is extremely useful for identifying the origin of our own model building bugs and for making incremental improvements to ORES' models.

At the very base of our Makefile, a user can run "make models" to rebuild all of the models of a certain type. We regularly perform this process ourselves to ensure that the Makefile is an accurate representation of the data flow pipeline. Performing complete rebuild is essential when a breaking change is made to one of our libraries. The resulting serialized models are saved to the source code repository so that a developer can review the history of any specific model and even experiment with generating scores using old model versions.

Model information

In order to use a model effectively in practice, a user needs to know what to expect from model performance. E.g. how often is it that when an edit is predicted to be "damaging" it actually is? (precision) or what proportion of damaging edits should I expect will be caught by the model? (recall) The target metric of an operational concern depends strongly on the intended use of the model. Given that our goal with ORES is to allow people to experiment with the use and reflection of prediction models in novel ways, we sought to build an general model information strategy.

https://ores.wikimedia.org/v3/scores/enwiki/?model_info&models=damaging returns:

      "damaging": {
        "type": "GradientBoosting",
        "version": "0.4.0",
        "environment": {"machine": "x86_64", ...},
        "params": {center": true, "init": null, "label_weights": {"true": 10},
                   "labels": [true, false], "learning_rate": 0.01, "min_samples_leaf": 1,
                   ...},
        "statistics": {
          "counts": {"labels": {"false": 18702, "true": 743},
                     "n": 19445,
                     "predictions": {"false": {"false": 17989, "true": 713},
                                     "true": {"false": 331, "true": 412}}},
          "precision": {"labels": {"false": 0.984, "true": 0.34},
                        "macro": 0.662, "micro": 0.962},
          "recall": {"labels": {"false": 0.962, "true": 0.555},
                     "macro": 0.758, "micro": 0.948},
          "pr_auc": {"labels": {"false": 0.997, "true": 0.445},
                     "macro": 0.721, "micro": 0.978},
          "roc_auc": {"labels": {"false": 0.923, "true": 0.923},
                      "macro": 0.923, "micro": 0.923},
          ...
        }
      }

The output captured in Figure ?? shows a heavily trimmed JSON (human- and machine-readable) output of model_info for the "damaging" model in English Wikipedia. Note that many fields have been trimmed in the interest of space with an ellipsis ("..."). What remains gives a taste of what information is available. Specifically, there's structured data about what kind of model is being used, how it is parameterized, the computing environment used for training, the size of the train/test set, the basic set of fitness metrics, and a version number so that secondary caches know when to invalidate old scores. A developer using an ORES model in their tools can use these fitness metrics to make decisions about whether or not a model is appropriate and to report to users what fitness they might expect at a given confidence threshold.

The scores

The predictions made by through ORES are also, of course, human- and machine-readable. In general, our classifiers will report a specific prediction along with a set of probability (likelihood) for each class. Consider article quality (wp10) prediction output in figure ??.

https://ores.wikimedia.org/v3/scores/enwiki/34234210/wp10 returns

        "wp10": {
          "score": {
            "prediction": "Start",
            "probability": {
              "FA": 0.0032931301528326693,
              "GA": 0.005852955431273448,
              "B": 0.060623380484537165,
              "C": 0.01991363271632328,
              "Start": 0.7543301344435299,
              "Stub": 0.15598676677150375
            }
          }
        }

A developer making use of a prediction like this may choose to present the raw prediction "Start" (one of the lower quality classes) to users or to implement some visualization of the probability distribution across predicted classed (75% Start, 16% Stub, etc.). They might even choose to build an aggregate metric that weights the quality classes by their prediction weight (e.g. Ross's student support interface[7] or the weighted_sum metric from [8]).

Threshold optimization

(import from other thread/essay)

  1. see "autolabel" in https://github.com/wiki-ai/editquality
  2. en:WP:WP10
  3. Warncke-Wang, Morten (2017): English Wikipedia Quality Asssessment Dataset. figshare. Fileset. https://doi.org/10.6084/m9.figshare.1375406.v2
  4. see the "extract_labelings" utility in https://github.com/wiki-ai/articlequality
  5. m:Wiki labels
  6. see our report meta:Research_talk:Automated_classification_of_edit_quality/Work_log/2017-05-04
  7. Sage Ross, Structural completeness
  8. Keilana Effect paper
Reply to "ORES system: Open, transparent process"