User:Htriedman

I'm a Privacy Engineer with the Wikimedia Foundation's Security and Privacy Team. I focus on algorithmic approaches to privacy and fairness.

Algorithmic Accountability Sheets Proof of Concept
Following along with the my previously-authored guidelines for good algorithmic governance, I've written three proof of concept algorithmic accountability sheets — one for each level in the algorithmic decision-making stack (dataset, model, service).

For intelligibility across WMF and the broader Wikimedia community, I've focused on English-language modeling. I'll be analyzing the enwiki edit quality dataset, the damaging/good faith edit models, and the larger ORES service.

In principle, these analyses can be replicated for any algorithmic component, in any language. If/when this process is formalized and put into production, algorithmic components and their analyses will be posted on Wikidata in order to ensure consistency and translatability between languages.

What is the motivation for creating this dataset?

 * The motivation for creating this dataset is to train an English language model that outputs a prediction of edit quality for the ORES service. Specifically, this dataset contains encodings about whether ~20,000 edits from 2014-2015 are “damaging” or “in good faith”.

Who created this dataset?

 * Aaron Halfaker (aaron.halfaker@gmail.com)

Who currently owns/is responsible for this dataset?

 * The WMF ML team

Who are the intended users of this dataset?

 * WMF employees, community members, and stakeholders

What should this dataset be used for?

 * Training English language models to predict damaging and good faith edits for a production context

What should this dataset not be used for?

 * Training models to predict English language text quality outside the context of MediaWiki services
 * Training models to predict Wikipedia edit quality for any language other than English

What community approval processes has this dataset gone through?

 * It was labeled through a crowd-sourced human computation effort on wikilabels, and ORES, the service that it helped train, is used for reviewing edits every day. Besides that (which may count as an implicit community approval process), there was no official approval process.

What internal or external changes could make this dataset deprecated or no longer usable?

 * A significant amount of time could pass
 * Downstream use cases change
 * Upstream data sources change

How should this dataset be licensed?

 * Creative Commons Attribution ShareAlike 3.0

How is the data collected?

 * Unclear, but it seems like it may be randomly sampled from all revisions between April 2014 and April 2015

Is the data continuously updated? If it is, are there links to older versions of the dataset?

 * No, the data is not continuously updated.

If the data is labeled, how does that process work?

 * This data was labeled in 2015-2016 using wikilabels, a distributed human computation engine. Individual volunteers receive batches of 50 revisions to rate, and are asked to judge 1) whether a revision is in good faith or bad faith and 2) whether a revision is damaging or not damaging.
 * It is unclear whether or not individual judgments serve as final labels for the dataset, or whether multiple judgments are aggregated to compute a label.

If the data is preprocessed or cleaned, how does that process work?

 * The revscoring package dynamically fetches the dataset, extracting features about:
 * the type of page a revision occurred on
 * the parent of a revision
 * characters, words, tokens, links, etc. added or taken away from a page
 * changes in the number of bad words, dictionary words, and non-dictionary words
 * user privileges of the revision author
 * Some columns take the natural log of the feature they encode

How is the data distributed statistically?
Total samples: 19,264

Time distribution
Range: 2014-04-15 to 2015-04-14

Minimum month: 1,481 revisions (2014-06-14 to 2014-07-15)

Maximum month: 1,880 revisions (2015-03-15 to 2015-04-14)



Time since registration
Range: 0 to 13.566 years

Revision tags and mobile edits
Total number of revisions with tags: 759

Total number of revisions with at least one mobile edit tab: 497

Distribution of tags:

Geographic analysis of anon revisions
Total number of anon revisions: 3,467

IPv4 anon revisions: 3,305

IPv6 anon revisions: 162

Looking at countries which have contributed more than 25 revisions (76.8% of the available data), this data seems like it is well-distributed across the English-speaking world.

Text data distributions
Among all revisions (n = 19,264): Among revisions judged as damaging (n = 745): Among revisions judged as good faith (n = 18,758):

Are there any sensitive attributes contained in the dataset?
Sensitive attributes include:


 * username (and if no username, then IP address)


 * time since registration
 * page edited
 * exact timestamp of edit
 * This dataset can also be linked with other on-wiki data (timestamp, comments, mobile edits, etc.) through mwapi, and IP addresses can be resolved to (relatively precise) locations.

New editors (<1 year since account creation)
Total: 6,215

Anonymous editors
Total: 3,467

Mobile editors
Total: 497

Non-US editors anonymous editors
Total: 2,114

Which models and services rely on this dataset?
The enwiki edit quality good faith/damaging model (described below) trains on this dataset. That model is a part of the ORES service, which indirectly relies on this dataset.

What is the motivation behind creating this model?

 * To prioritize review of potentially damaging edits or vandalism. The model provides a guess at whether or not a given revision is damaging, and provides some probabilities to serve as a measure of its confidence level.

Who created this model?

 * Aaron Halfaker (aaron.halfaker@gmail.com) and Amir Sarabadani (amir.sarabadani@wikimedia.de)

Who currently owns/is responsible for this model?

 * The WMF ML Team

Who are the intended users of this model?

 * English Wikipedia uses the model as a service for facilitating efficient edit reviews. On an individual basis, anyone can submit a properly-formatted API call to ORES for a given revision and get back the result of this model.

What should this model be used for?

 * This model should be used for prioritizing the review and potential reversion of damaging edits and vandalism on English Wikipedia.

What should this model not be used for?

 * This model should not be used as an ultimate arbiter of whether or not an edit ought to be considered damaging.


 * It should not be used for any other English-language wiki besides English Wikipedia, and shouldn't be used for other languages.

What community approval processes has this model gone through?

 * English Wikipedia decided (note: don't know where/when this decision was made, would love to find a link to that discussion) to use this model.

What internal or external changes could make this model deprecated or no longer usable?

 * Data drift means training data for the model is no longer usable
 * Doesn't meet desired performance metrics in production
 * English Wikipedia community decides to not use this model anymore

How should this model be licensed?

 * Creative Commons Attribution ShareAlike 3.0

What is the architecture of this model?

 * Scikit-learn Gradient Boosting Trees Classifier
 * Parameters
 * Learning rate = 0.01
 * Max depth = 7
 * Max features = log2
 * Number of estimators = 700

How does the model perform on test/real world data across different languages, different geographies, different devices, etc.?
To test this model, I found labeled data from the unfinished 2016 edit quality dataset. This campaign never reached its intended goal of 6,333 labeled revisions, but there are 3,843 usable damaging/not damaging labels in the data set, encompassing edits from August 23 2015 to August 22 2016. This dataset uses the exact same schema as the data the model was trained on.

Of these 3,843 labels, 846 (~21.5%) of them were labeled damaging and 3,017 (~78.5%) of them were labeled not damaging.

The results of analyzing this data are below:

First I compared the metrics about training stored in the model metadata with the results of the model on the test set from a year later. Not all of the fields are filled in for the training data, but you get the drift. Next, I've broken down performance along the binary of new editors (<1 year since account creation) vs. not-new editors else, anonymous editors vs. not-anonymous editors, and mobile vs. desktop editors.

Accuracy is below the average for the test set for new editors, anonymous editors, and mobile editors and above the average for old editors, non-anonymous editors, and desktop editors.

tk add something about demographic parity, equal opportunity, equalized odds.

What about at the intersections of those attributes?
We have location data for anonymous editors in the dataset. As such, we can analyze them based off whether they're in the US or not. There isn't much difference between these two datasets, though they're both worse than the test accuracy. Finally, we can look at other intersections. As expected, performance is far lower than the test set, with accuracy around 0.72-0.74.

If this model is retrained, can we see how it has changed over time?
To my knowledge, this model is not retrained over time — it still uses the original dataset from 2014-2015.

How does this model mitigate data drift?
This model does not mitigate data drift.

If there are decision thresholds for classification, what are those thresholds?
The decision threshold for classification is the default value of 0.5. However, the model also provides confidence estimates, so users can get a sense of how certain the model is.

Which services rely on this model?
ORES (service sheet below)

Which datasets does this model rely on?
The 2014-2015 edit quality dataset (datasheet above)

ORES Service Sheet
tk