ORES/Newcomerquality

Overview
The newcomer quality project aims to use machine learning to predict how damaging and goodfaith new editors to Wikipedia projects are, within their first few edits. The technology builds on the ORES platform, aggregating the lower-level predictions about edits into higher-level predictions about edit-sessions, and finally to predictions about users.

Community Motivation
Newcomer retention is one of the largest problems facing Wikipedias today. One approach that has found success are newcomer welcoming and mentoring programs such as the en:Wikipedia:Teahouse (or fr:WP:Forum des Nouveaux.) However getting new editors to those forums usually involves either a) inviting all newcomers, which has the problems of overwhelming mentors and potentially invites vandals, or b) inviting a subset of newcomers based on heuristics, which could miss out on some good editors. Artificial intelligence or machine learning could potentially bridge the gap by inviting only the best newcomers without humans having to sort through the hundreds or thousands of newly registered editors each day.

Technical Motivation
ORES as a predictive algorithm can already predict the quality of single edits and articles, this project aims to extend that capability to sessions of multiple related edits. Being able to predict session quality paves the way for potential future tools such as automatically detecting promising new editors or edit wars on pages. Of course this idea is not new, since 2014 Snuggle has been trying to detect new editors that may have been bitten by vandal-fighters, but its infrastructure is reliant on pre-ORES technology, and is not easily generalizable. Continuing on that stream of work with ORES we could start predicting labels for collections of edits of all kinds.

HostBot Integration
Since about 2016 en:User:HostBot has been working in tandem with the English Wikipedia TeaHouse to do the repetitive work of inviting newly-registered. In order to keep the number of invitees manageable for TeaHouse hosts, it limits itself to inviting just 300 users among the approximately 2000 qualifying every day.

New page potential
Other possible uses for the technology being developed is to classify collections of edits all relating to a new page rather than a new user. In this way we could aid article creation and and article deletion processes by classifying damaging or goodfaith new pages.

Proposed Initial Experiment: TeaHouse invites
In order to test this new technology, we propose an A/B test between the current iteration of HostBot and an AI-powered prototype (HostBot2) to determine if this project can help retain newcomers better.

Please see the full experiment details on meta at: meta:Research:ORES-powered_TeaHouse_Invites

Labelling Campaigns
Any machine learning project requires ground-truth labels. The labelling campaigns are:


 * enwiki campaign w:en:Wikipedia:Labels/Newcomer_session_quality

Definitions

 * A session of edits are all the edits which were made within one hour of each other, see the pioneering paper for motivation and validation . We chose to use the session and not the user's history as a whole in order closely mode the small dynamics that may be related to what a user does when they sit down for an edit session.
 * The goodfaith attribute is easily defined for an edit, but what about for a collection of edits. We choose to use the definition that a session is good faith if and only if all the edits of a session to a human judge appear to be in goodfaith.

Feature lists
The basic features that are used in predicting a session so far are:


 * basic statistics of the underlying goodfaith scores of the session's edits


 * slope and intercept of a line drawn through the goodfaith scores of the session's edits edit as a time-vector
 * temporal statistics of the inter-edit times of the session's edits
 * the self-reverts of the user
 * whether the user appears to be in an edit war
 * statistics about the namespaces used in the session's edits
 * see more in the ipynb.

The model

 * The metric used for scoring models chose was precision at k=300 which mirrors the domains problem of choosing the best 300 editors to invite each day
 * Both gradient boosting and logistic regression appear to achieve approximately 95% precision in the top 300 most confident predictions. The prior for the goodfaith class is approximately 74% (most newcomer sessions are goodfaith). See more in the ipynb.

Interface

 * The classifier will be available as an open-soruce python package at . This allows users to a pre-trained model and make local predictions about new users.
 * The pretrained model is enwiki-only at first (this was so that the developers could also contribute to the labelling). We are eager to expand to more languages with volunteer labelling help.
 * Later on the plan would be to release the classifier as an API as part of the ORES platform.
 * The classifier is trained to give predictions on sessions. In order to get predictions for a user, a user id and maximum timestamp maybe supplied, and heuristics can be used to aggregate the session predictions. These heuristics at first are the mean predictions score and majority voting.

Future Directions and Open Questions

 * Change user labels from (2) damaging and goodfaith to more fine-grained taxonomy of (4) adding vandal and golden see taxonomy.
 * A well-researched method to aggregate session predictions into single user predictions, rather than current heuristic.
 * Are damaging-but-goodfaith editors more easily detectable in sessions? An initial investigation.