Machine Learning/Modernization

An introduction to machine learning
Imagine we want to identify vandalism on Wikipedia in real-time. To accomplish this, we would need to examine every new edit to Wikipedia and decide whether it was vandalism. There are three approaches we can use:

First, enlist the help of some amazing volunteers. Humans are excellent at detecting vandalism. However, the volume and rate of new edits on Wikipedia means that many thousands of human hours would be required to review each edit and identify vandalism. That is, a human-based approach is highly effective at identifying vandalism but runs into difficulty due to the volume and rate of incoming Wikipedia edits.



Second, create a set of rules. A lot of vandalism is obvious, such as an edit containing a string of rude words. Because of this, we could create a rule that labels any edit which violates those rules as vandalism. For example, we could create a rule that human labels any edit containing a word that is part of some pre-set list of rude words as vandalism. This rule can be easily applied to all incoming Wikipedia edits, but will miss the vandalism that doesn’t contain rude words, or contains rude words, not in the list. A rule-based approach can scale to the volume and rate of incoming Wikipedia edits but is not very effective.



Third, we can use machine learning to replicate the effectiveness of volunteers in detecting vandalism with the scalability of the rule-based approach. In the human-based approach, volunteers labeled edits as vandalism or not. In the rule-based approach, we applied a pre-set rule to incoming Wikipedia edits. In machine learning, we take a different approach: give an algorithm a set of edits and their created human labels (i.e., vandalism or not) and have it “learn” what set of rules creates similar labels as those created by humans. The resulting trained model can then be applied to all incoming Wikipedia edits. This approach can scale like the rules-based approach while being approximately as good at detecting vandalism as the human-based approach.



In reality, Wikipedia uses all three approaches – volunteer editors, rule-based bots, and machine learning – to detect vandalism. While different, the use of multiple approaches complements each other, each one mitigating the weakness of the others.

What is a model?
In machine learning, training algorithms use data to create models. While a training algorithm takes in data and outputs a model, a model takes in data (e.g., Wikipedia edits) and outputs a prediction (e.g., vandalism or not). There are many types of training algorithms, each creating models with different strengths and weaknesses. For example, a decision tree algorithm creates models that are highly explainable but are often not as high-performing (e.g., not as accurate) as other types of models.

Models come in many forms, depending on the training algorithm used and the settings employed during training. A large model like Google’s Pathways Language Model is massive, containing 540 billion parameters (i.e., settings) that interact in complex ways to create a prediction. Other models are much more simple. Continuing our vandalism example, imagine we had used a decision tree algorithm to create a decision tree model. What exactly might that model look like?

As their name implies, decision tree training algorithms create decision trees. Here is a simple, fictional example of a decision-tree model that a training algorithm might create:



In this decision-tree model, the model has learned from seeing Wikipedia edits and their human-created labels (i.e., vandalism or not) that this decision-tree produces labels very similar to those a human would make. Specifically, if an edit contains a rude word, it is likely vandalism. However, if it does not contain a rude word, then if the editor has less than 500 edits, it is likely vandalism. But if the editor has more than 500 edits, if the article is about politics, it is likely vandalism, but otherwise is not likely vandalism.

This model can then be applied to every new edit to Wikipedia, for each edit outputting a predicted label of vandalism or not.

An introduction to machine learning operations?
Imagine we train the anti-vandalism model example from above. How do we deploy it into production? How do we quickly make improvements at the request of communities? How do we track how well it is performing? This is the area of machine learning operations (MLOps). MLOps is a specialty covering machine learning, developer operations (DevOps) and data engineering that focuses on managing the lifecycles of machine learning models.

When you have one professional relationship (i.e., user, customer, etc.), it is easy to keep track of their contact information and remember your previous conversations. However, as the number of relationships grows into the hundreds or even millions, managing all that contact information requires specialized tools like customer relationship management (CRM) systems like Salesforce. Similarly, when you have one machine learning model in production, it is easy to manage its lifecycle. However, as the number of machine learning models grows, managing all the different models under development, in production, and deprecated becomes difficult. Similar to how CRM systems manage customer relationships, MLOps systems manage an organization's machine learning models. Two years ago, we selected Kubeflow, a popular open source Kubernetes-based MLOps system, as our in-house infrastructure for developing, deploying, and managing the life cycles of Wikimedia’s machine learning models. This infrastructure was split into two parts (both running Kubeflow): Lift Wing, for managing the deployment in production of models, and Train Wing, for managing the development and maintenance of models.

Machine learning on Wikipedia
Machine learning has been used on Wikipedia for over a decade. The first uses of machine learning, and some of the most popular to this day, are bots created by the community to streamline various tasks. These bots are owned and maintained by the community, although many are hosted on Wikimedia’s Toolforge platform.

One of the most notable bots on Wikipedia is ClueBot NG, which has been active since at least 2012. ClueBot NG detects vandalism on Wikipedia using a machine learning model. The model is created using human-labeled training data. That is, volunteers use an interface to manually label an edit as vandalism or not. A training algorithm then uses that data to create the model, which then identifies new edits suspected of vandalism and reverts them.

In addition to community bots, Wikimedia itself uses, creates, deploys, and maintains hundreds of machine learning models serving various roles, from anti-vandalism to improving editor experiences.

One notable difference between how the Foundation and how community volunteers use machine learning models is their integration into bots. As a general practice, Wikimedia does not use models to directly edit the contents of Wikipedia, but instead, we use them to inform the actions of volunteer contributors. This is because it is the volunteer community, not Wikimedia that stewards the contents of Wikipedia, and therefore, changes in content should, as a general practice should, be done by individual communities. For example, Cluebot NG (discussed above) will revert some edits identified as vandalism, whereas Wikimedia’s Damaging models feature (discussed below) is limited to highlighting the edit to volunteer editors. We will see more examples of this norm below.

Machine learning at Wikimedia
Wikimedia both builds and hosts its own machine learning models and uses third-party model APIs (e.g. Google Translate).

ORES
Since 2015, Wikimedia has built and hosted machine learning models that could detect vandalizing edits on Wikipedia, collectively called ORES. These models and infrastructure grew into what is today known as ORES. Over the years, ORES has been discussed at length in news articles, reports, and academic research.

The ORES models are used in a variety of places both by Wikipedia and the community. For example, English Wikipedia’s Recent changes page flags edits predicted to be vandalism. In addition, Huggle, a community-built Wikipedia editing browser that helps editors detect vandalism, incorporates ORES-generated vandalism predictions into the user interface.

ORES models
Today ORES hosts about 110 machine learning models. These models are trained using a purpose-built open source model training library (RevScoring). ORES maintains four types of models: article quality, edit quality, item quality, and article topic. Most models are language-specific; for example, the Dutch article quality model is trained on data from the Dutch Wikipedia community and applied only to Dutch Wikipedia articles.

Article quality models predict the quality of Wikipedia articles, whether using the English Wikipedia’s 1.0 article rating scale or a community’s own rating scale. These ratings help editors review new article submissions and find existing articles that are opportunities for improvement. There are two different kinds of article quality models. First, draft quality models predict whether newly created articles should be flagged for deletion because they are identified as spam, vandalism, etc. Second, article quality models predict where an article will fall on an existing article rating scale. There are currently three draft quality and 18 article quality language-specific models available on ORES.

Edit quality models predict vandalizing or otherwise bad edits to existing articles. They are designed to help editors identify edits that need special or rapid attention. There are three different kinds of language specific models: reverted, good faith, and damaging. Reverted edit quality models predict whether an edit will eventually be reverted by a subsequent human edit (whether due to vandalism or something else). There are 12 language-specific good faith edit quality models in production. Good faith edit quality models predict whether an edit was made in good faith (or with the intent of causing harm). These models are created using hand-labeled data from volunteers. Damaging edit quality models predict whether an edit is damaging to Wikipedia. This is useful for both human editors patrolling Wikipedia and quality control bots. Similar to the good faith models, damage edit quality models are trained using data labeled by human volunteers. There are 33 language-specific damaging edit quality models in production.

The item quality model is similar to article quality models but applied to Wikidata items. There is a single item quality model in use on Wikidata.

Article topic models predict where an article is best classified within a purpose-built topic taxonomy. These models are useful for human editors to curate new articles and identify gaps in articles on their topics. There are two different kinds of article topic models. Draft topic models predict the topic of a new article draft while article topic models predict the topic of existing articles. There is one draft topic model and five article topic models available for use.

ORES infrastructure
ORES uses a model hosting infrastructure built for hosting the four types of models listed above. The infrastructure’s hardware includes 18 servers across two data centers. Predictions are served to users using a RESTful API that requires no authentication by users.

Most models on ORES take about one second to compute a prediction from the given inputs and return a response to an API query. Since this is too slow for many use cases, a pre-cache system is implemented within ORES. The pre-cache system runs models against new Wikipedia edits and caches the results, allowing for faster API response times.

The ORES infrastructure also includes a MediaWiki extension that integrates ORES predictions into Wikipedia’s RecentChanges page.

For human labeling of data, a purpose-built collaborating labeling tool (Wikilabels) is used. This tool is used to create the language-specific labels used for training article quality and edit quality models.

Other Wikimedia models
In addition to machine learning models hosted on ORES, Wikimedia also creates and hosts a variety of models elsewhere in the Wikimedia technical infrastructure. For example, the Machine Learning and Research teams built and maintain hundreds of language-specific models as part of the Add-A-Link structured task project. The Add-A-Link structured task uses machine learning to recommend potential edits (specifically, adding a link to existing text that will point to another article) to new editors to allow them to achieve some high-quality edits quickly and easily. Another example is the content translation recommendation tool, which uses a Wikimedia-created machine learning model to recommend articles for translation to editors.

Third-party models
Wikimedia also uses a number of third-party models. These models are not hosted by Wikimedia but rather are accessed using an API. These models include:


 * TurnItIn copyright violation detection API for detecting copyright infringement of new content.
 * Google Translate and Yandex Translate APIs in the content translation tool to help human editors translate articles.

Machine learning modernization efforts
The ORES infrastructure was a major step forward for Wikimedia’s machine learning infrastructure and has served its purpose well for many years. However, as the breadth and scale of machine learning application needs increased, there became a need for a modernization effort. We are working on a multi-year effort to modernize Wikimedia’s machine learning systems and processes, building new systems that allow for the rapid, safe development and deployment of a wide variety of models.

Guiding principles

 * Scalable - One of the lessons we have learned from ORES is that any MLOps system maintained by Wikimedia must be quickly scalable. The machine learning needs of various teams at Wikimedia, users, and the community vary widely over time and the infrastructure must be able to adapt without significant redesign.
 * Agile - Adoption of modern software engineering best practices means moving away from the waterfall model that has been common practice in the past in model development. Instead, both process workflows and technical systems need to be designed to allow rapid iteration on models as they are developed. This means model creators (whether Wikimedia staff or volunteer technical contributors) can quickly see changes reflected in production, evaluate them, and use that information to improve the models.
 * Collaboration - Machine learning at Wikimedia operates in the open – our work is highly visible on-wiki and in our public source code repository, even when it is unfinished. Leaning into this superpower means creating social-technical systems that allow quick collaborations between staff and technical contributors.
 * Public Best Practice Example Of Ethical ML - Wikimedia’s open model of development and principles as a community organization means it is in a position of not only adopting ethical machine learning approaches but also publicly advocating for them and demonstrating their effectiveness. The goal is for the public to learn from our efforts around ethical machine learning, both successes and failures.
 * Human In The Loop - Wikipedia is, at the end of the day, a human-centric project. This s means that wherever machine learning models hosted by Wikimedia are being used on wikis, it should be clear and unambiguous to human users and contain an easily accessed path for human feedback.
 * Accessibility - As a collaborative community project, machine learning at Wikimedia should move beyond open source as an end goal and instead focus on accessibility. While open source remains a major principle, a focus on accessibility means making it easy for community members and technical contributors with different levels of skill and expertise in machine learning to participate in model development and governance.
 * Community Governance - Wikimedia supports the free knowledge movement, and in that spirit, communities must have control over machine learning models that affect them.

Benefits of modernization

 * Centralize model lifecycle management - Currently, machine learning models are hosted in different areas around the Wikimedia technical infrastructure. This dispersion makes the rapid, effective management of the lifecycle of machine learning models (i.e. development, iteration, deprecation) time consuming and haphazard. The goal is to centralize the management of machine learning models hosted by Wikimedia into a single infrastructure, allowing for fast, responsive iterations regardless of the model.
 * Rapid automated deployment and iteration - Model deployments at Wikimedia are too often slow and haphazard. Modern MLOps approaches and systems will make changing a model quick and safe, allowing for fast iterative development and responsiveness to user needs.
 * Regulatory compliance - With machine learning models increasingly under scrutiny by governments and regulators, a modern MLOps system will better situate Wikimedia to evaluate the impact of regulatory changes and respond quickly.
 * Scalability - Machine learning use at Wikimedia has grown rapidly over the years. Our modernization work is building the infrastructure to adapt to rapid scaling changes quickly, without significant system redesigns or software changes.
 * Wide range of use cases - ORES focused exclusively on hosting tree-based machine learning models. While powerful, there is a growing need to support a broader range of machine learning approaches and libraries, including GPU-powered computer vision.
 * Greater collaboration - Engaging with machine learning projects hosted by Wikimedia should be easy, well-documented, and fun. This means building in open, accessible ways that invite timely engagements from the community and that are quickly reflected in production. The question we are focused on is how long does it take a technical contributor interested in machine learning to meaningfully contribute?
 * Community governance - Communities need simple, clear, publicized pathways to govern the machine learning models that are affecting them. The modernization efforts is focused on building those pathways.