ORES/Newcomerquality

Overview
The newcomer quality project aims to use machine learning to predict how damaging and goodfaith new editors to Wikipedia projects are, within their first few edits. The technology builds on the ORES platform, aggregating the lower-level predictions about edits into higher-level predictions about edit-sessions, and finally to predictions about users.

Community Motivation
Newcomer retention is one of the largest problems facing Wikipedias today. One approach that has found success are newcomer welcoming and mentoring programs such as the en:Wikipedia:Teahouse (or fr:WP:Forum des Nouveaux.) However getting new editors to those forums usually involves either a) inviting all newcomers, which has the problems of overwhelming mentors and potentially invites vandals, or b) inviting a subset of newcomers based on heuristics, which could miss out on some good editors. Artificial intelligence or machine learning could potentially bridge the gap by inviting only the best newcomers without humans having to sort through the hundreds or thousands of newly registered editors each day.

Technical Motivation
ORES as a predictive algorithm can already predict the quality of single edits and articles, this project aims to extend that capability to sessions of multiple related edits. Being able to predict session quality paves the way for potential future tools such as automatically detecting promising new editors or edit wars on pages. Of course this idea is not new, since 2014 Snuggle has been trying to detect new editors that may have been bitten by vandal-fighters, but its infrastructure is reliant on pre-ORES technology, and is not easily generalizable. Continuing on that stream of work with ORES we could start predicting labels for collections of edits of all kinds.

HostBot Integration
Since about 2016 en:User:HostBot has been working in tandem with the English Wikipedia TeaHouse to do the repetitive work of inviting newly-registered. In order to keep the number of invitees manageable for TeaHouse hosts, it limits itself to inviting just 300 users among the approximately 2000 qualifying every day.

New page potential
Other possible uses for the technology being developed is to classify collections of edits all relating to a new page rather than a new user. In this way we could aid article creation and and article deletion processes by classifying damaging or goodfaith new pages.

Proposed Initial Experiment: TeaHouse invites
In order to test this new technology, we propose an A/B test between the current iteration of HostBot and an AI-powered prototype (HostBot2) to determine if this project can help retain newcomers better.

How does HostBot currently work?
The current way that HostBot works is that every day it searches for users with the following criteria:


 * 1) User registered within the last 2 days
 * 2) User made more than 5 edits
 * 3) User is not blocked

It then selects 300 users randomly meeting those criteria and invites them to the TeaHouse.

How would HostBot2 work?
HostBot2 would perform the same operation as HostBot—inviting users to the TeaHouse—but it would prioritize the editors it invited with AI, rather than selecting randomly.

The AI would prioritize the predicted goodfaith-ness of the editors. (Note: it would be alternatively possible to prioritize the non-damaging-ness of editors, but we believe that goodfaith measure is more inline with the TeaHouse's values.)

Another shortcoming of HostBot is that it operates mores slowly checking daily to see if users have crossed an edit threshhold. HostBot2 could operate more quickly, making predictions for any user after their first or second edit, in close to real-time.

How different would theses two methods even be?
The AI-powered prototype would still respect the 300 users per day invite limit, but it would select the 300 users it had the highest confidence were goodfaith. For instance if a new user makes 5 edits on a page that is vaguely promotional or vain, they might not be blocked and HostBot might invite them. HostBot2 on the otherhand would predict them to be only moderately goodfaith and prefer less promotionally editors in their place. As there are about 2,000 users every day registering on English Wikipedia this prioritization could make a big difference. See an example differences in the invite lists here.

What are the exact parameters of the experiment?

 * 1) The test conducted would be an A/B test between HostBot (A) and HostBot2 (B).
 * 2) The exact statistical test and retention metric have not been finalized (we welcome your input). We would like to copy as much of the statiscal measures used as previous papers on the TeaHouse so that our results could be comparable.
 * 3) The experiment would run for about a month.
 * 4) The experiment randomization would come from randomizing which days each version of HostBot was running.
 * 5) The experiment would be "blind" meaning that the hosts and invitees would not know which method was being used

What are the risks?
The main risk to conducting this experiment is that HostBot2 will not be as effective at inviting high-quality users as HostBot, and as such we will fail to invite the the more deserving editors to the TeaHouse. In that case, in the very worst-case scenario, 300 users/day * 15 days = 4,500 users will not get the TeaHouse invites they would have otherwise. This calculation is an upper-bound on the risk, as it is likely that some of the users HostBot2 invites will be the same as HostBot.

Does committing to the experiment mean committing to switching to HostBot2?
No. For now we only want to run the experiment, if it is successful a second community consultation could be had to instating the bot. If the experiment is unsuccessful it would be a learning experience for the developers of the technology.

Who is running this experiment exactly?
I Maximilianklein would be the principal researcher and software engineer in my capacity as a contractor for the Wikimedia Scoring Platform team. Experiment design consultation also comes from user:JTmorgan, author of HostBot.

Labelling Campaigns
Any machine learning project requires ground-truth labels. The labelling campaigns are:


 * enwiki campaign w:en:Wikipedia:Labels/Newcomer_session_quality

Definitions

 * A session of edits are all the edits which were made within one hour of each other, see the pioneering paper for motivation and validation . We chose to use the session and not the user's history as a whole in order closely mode the small dynamics that may be related to what a user does when they sit down for an edit session.
 * The goodfaith attribute is easily defined for an edit, but what about for a collection of edits. We choose to use the definition that a session is good faith if and only if all the edits of a session to a human judge appear to be in goodfaith.

Feature lists
The basic features that are used in predicting a session so far are:


 * basic statistics of the underlying goodfaith scores of the session's edits


 * slope and intercept of a line drawn through the goodfaith scores of the session's edits edit as a time-vector
 * temporal statistics of the inter-edit times of the session's edits
 * the self-reverts of the user
 * whether the user appears to be in an edit war
 * statistics about the namespaces used in the session's edits
 * see more in the ipynb.

The model

 * The metric used for scoring models chose was precision at k=300 which mirrors the domains problem of choosing the best 300 editors to invite each day
 * Both gradient boosting and logistic regression appear to achieve approximately 95% precision in the top 300 most confident predictions. The prior for the goodfaith class is approximately 80% (most newcomer sessions are goodfaith). See more in the ipynb.

Interface

 * The classifier will be available as an open-soruce python package at . This allows users to a pre-trained model and make local predictions about new users.
 * The pretrained model is enwiki-only at first (this was so that the developers could also contribute to the labelling). We are eager to expand to more languages with volunteer labelling help.
 * Later on the plan would be to release the classifier as an API as part of the ORES platform.
 * The classifier is trained to give predictions on sessions. In order to get predictions for a user, a user id and maximum timestamp maybe supplied, and heuristics can be used to aggregate the session predictions. These heuristics at first are the mean predictions score and majority voting.

Future Directions and Open Questions

 * Change user labels from (2) damaging and goodfaith to more fine-grained taxonomy of (4) adding vandal and golden see taxonomy.
 * A well-researched method to aggregate session predictions into single user predictions, rather than current heuristic.
 * Are damaging-but-goodfaith editors more easily detectable in sessions? An initial investigation.