Machine Learning/Modernization

An introduction to machine learning
Imagine we want to identify vandalism on Wikipedia in real-time. To accomplish this, we would need to examine every new edit to Wikipedia and decide whether it was vandalism. There are three approaches we can use:

First, enlist the help of some amazing volunteers. Humans are excellent at detecting vandalism. However, the volume and rate of new edits on Wikipedia means that many thousands of human hours would be required to review each edit and identify vandalism. That is, a human-based approach is highly effective at identifying vandalism but runs into difficulty due to the volume and rate of incoming Wikipedia edits.



Second, create a set of rules. A lot of vandalism is obvious, such as an edit containing a string of rude words. Because of this, we could create a rule that labels any edit which violates those rules as vandalism. For example, we could create a rule that human labels any edit containing a word that is part of some pre-set list of rude words as vandalism. This rule can be easily applied to all incoming Wikipedia edits, but will miss the vandalism that doesn’t contain rude words, or contains rude words, not in the list. A rule-based approach can scale to the volume and rate of incoming Wikipedia edits but is not very effective.



Third, we can use machine learning to replicate the effectiveness of volunteers in detecting vandalism with the scalability of the rule-based approach. In the human-based approach, volunteers labeled edits as vandalism or not. In the rule-based approach, we applied a pre-set rule to incoming Wikipedia edits. In machine learning, we take a different approach: give an algorithm a set of edits and their created human labels (i.e., vandalism or not) and have it “learn” what set of rules creates similar labels as those created by humans. The resulting trained model can then be applied to all incoming Wikipedia edits. This approach can scale like the rules-based approach while being approximately as good at detecting vandalism as the human-based approach.



In reality, Wikipedia uses all three approaches – volunteer editors, rule-based bots, and machine learning – to detect vandalism. While different, the use of multiple approaches complements each other, each one mitigating the weakness of the others.

What is a model?
In machine learning, training algorithms use data to create models. While a training algorithm takes in data and outputs a model, a model takes in data (e.g., Wikipedia edits) and outputs a prediction (e.g., vandalism or not). There are many types of training algorithms, each creating models with different strengths and weaknesses. For example, a decision tree algorithm creates models that are highly explainable but are often not as high-performing (e.g., not as accurate) as other types of models.

Models come in many forms, depending on the training algorithm used and the settings employed during training. A large model like Google’s Pathways Language Model is massive, containing 540 billion parameters (i.e., settings) that interact in complex ways to create a prediction. Other models are much more simple. Continuing our vandalism example, imagine we had used a decision tree algorithm to create a decision tree model. What exactly might that model look like?

As their name implies, decision tree training algorithms create decision trees. Here is a simple, fictional example of a decision-tree model that a training algorithm might create:



In this decision-tree model, the model has learned from seeing Wikipedia edits and their human-created labels (i.e., vandalism or not) that this decision-tree produces labels very similar to those a human would make. Specifically, if an edit contains a rude word, it is likely vandalism. However, if it does not contain a rude word, then if the editor has less than 500 edits, it is likely vandalism. But if the editor has more than 500 edits, if the article is about politics, it is likely vandalism, but otherwise is not likely vandalism.

This model can then be applied to every new edit to Wikipedia, for each edit outputting a predicted label of vandalism or not.

An introduction to machine learning operations?
Imagine we train the anti-vandalism model example from above. How do we deploy it into production? How do we quickly make improvements at the request of communities? How do we track how well it is performing? This is the area of machine learning operations (MLOps). MLOps is a specialty covering machine learning, developer operations (DevOps) and data engineering that focuses on managing the lifecycles of machine learning models.

When you have one professional relationship (i.e., user, customer, etc.), it is easy to keep track of their contact information and remember your previous conversations. However, as the number of relationships grows into the hundreds or even millions, managing all that contact information requires specialized tools like customer relationship management (CRM) systems like Salesforce. Similarly, when you have one machine learning model in production, it is easy to manage its lifecycle. However, as the number of machine learning models grows, managing all the different models under development, in production, and deprecated becomes difficult. Similar to how CRM systems manage customer relationships, MLOps systems manage an organization's machine learning models. Two years ago, we selected Kubeflow, a popular open source Kubernetes-based MLOps system, as our in-house infrastructure for developing, deploying, and managing the life cycles of Wikimedia’s machine learning models. This infrastructure was split into two parts (both running Kubeflow): Lift Wing, for managing the deployment in production of models, and Train Wing, for managing the development and maintenance of models.

Machine learning on Wikipedia
Machine learning has been used on Wikipedia for over a decade. The first uses of machine learning, and some of the most popular to this day, are bots created by the community to streamline various tasks. These bots are owned and maintained by the community, although many are hosted on Wikimedia’s Toolforge platform.

One of the most notable bots on Wikipedia is ClueBot NG, which has been active since at least 2012. ClueBot NG detects vandalism on Wikipedia using a machine learning model. The model is created using human-labeled training data. That is, volunteers use an interface to manually label an edit as vandalism or not. A training algorithm then uses that data to create the model, which then identifies new edits suspected of vandalism and reverts them.

In addition to community bots, Wikimedia itself uses, creates, deploys, and maintains hundreds of machine learning models serving various roles, from anti-vandalism to improving editor experiences.

One notable difference between how the Foundation and how community volunteers use machine learning models is their integration into bots. As a general practice, Wikimedia does not use models to directly edit the contents of Wikipedia, but instead, we use them to inform the actions of volunteer contributors. This is because it is the volunteer community, not Wikimedia that stewards the contents of Wikipedia, and therefore, changes in content should, as a general practice should, be done by individual communities. For example, Cluebot NG (discussed above) will revert some edits identified as vandalism, whereas Wikimedia’s Damaging models feature (discussed below) is limited to highlighting the edit to volunteer editors. We will see more examples of this norm below.

Machine learning at Wikimedia
Wikimedia both builds and hosts its own machine learning models and uses third-party model APIs (e.g. Google Translate).

ORES
Since 2015, Wikimedia has built and hosted machine learning models that could detect vandalizing edits on Wikipedia, collectively called ORES. These models and infrastructure grew into what is today known as ORES. Over the years, ORES has been discussed at length in news articles, reports, and academic research.

The ORES models are used in a variety of places both by Wikipedia and the community. For example, English Wikipedia’s Recent changes page flags edits predicted to be vandalism. In addition, Huggle, a community-built Wikipedia editing browser that helps editors detect vandalism, incorporates ORES-generated vandalism predictions into the user interface.

ORES models
Today ORES hosts about 110 machine learning models. These models are trained using a purpose-built open source model training library (RevScoring). ORES maintains four types of models: article quality, edit quality, item quality, and article topic. Most models are language-specific; for example, the Dutch article quality model is trained on data from the Dutch Wikipedia community and applied only to Dutch Wikipedia articles.

Article quality models predict the quality of Wikipedia articles, whether using the English Wikipedia’s 1.0 article rating scale or a community’s own rating scale. These ratings help editors review new article submissions and find existing articles that are opportunities for improvement. There are two different kinds of article quality models. First, draft quality models predict whether newly created articles should be flagged for deletion because they are identified as spam, vandalism, etc. Second, article quality models predict where an article will fall on an existing article rating scale. There are currently three draft quality and 18 article quality language-specific models available on ORES.

Edit quality models predict vandalizing or otherwise bad edits to existing articles. They are designed to help editors identify edits that need special or rapid attention. There are three different kinds of language specific models: reverted, good faith, and damaging. Reverted edit quality models predict whether an edit will eventually be reverted by a subsequent human edit (whether due to vandalism or something else). There are 12 language-specific good faith edit quality models in production. Good faith edit quality models predict whether an edit was made in good faith (or with the intent of causing harm). These models are created using hand-labeled data from volunteers. Damaging edit quality models predict whether an edit is damaging to Wikipedia. This is useful for both human editors patrolling Wikipedia and quality control bots. Similar to the good faith models, damage edit quality models are trained using data labeled by human volunteers. There are 33 language-specific damaging edit quality models in production.

The item quality model is similar to article quality models but applied to Wikidata items. There is a single item quality model in use on Wikidata.

Article topic models predict where an article is best classified within a purpose-built topic taxonomy. These models are useful for human editors to curate new articles and identify gaps in articles on their topics. There are two different kinds of article topic models. Draft topic models predict the topic of a new article draft while article topic models predict the topic of existing articles. There is one draft topic model and five article topic models available for use.

ORES infrastructure
ORES uses a model hosting infrastructure built for hosting the four types of models listed above. The infrastructure’s hardware includes 18 servers across two data centers. Predictions are served to users using a RESTful API that requires no authentication by users.

Most models on ORES take about one second to compute a prediction from the given inputs and return a response to an API query. Since this is too slow for many use cases, a pre-cache system is implemented within ORES. The pre-cache system runs models against new Wikipedia edits and caches the results, allowing for faster API response times.

The ORES infrastructure also includes a MediaWiki extension that integrates ORES predictions into Wikipedia’s RecentChanges page.

For human labeling of data, a purpose-built collaborating labeling tool (Wikilabels) is used. This tool is used to create the language-specific labels used for training article quality and edit quality models.

Other Wikimedia models
In addition to machine learning models hosted on ORES, Wikimedia also creates and hosts a variety of models elsewhere in the Wikimedia technical infrastructure. For example, the Machine Learning and Research teams built and maintain hundreds of language-specific models as part of the Add-A-Link structured task project. The Add-A-Link structured task uses machine learning to recommend potential edits (specifically, adding a link to existing text that will point to another article) to new editors to allow them to achieve some high-quality edits quickly and easily. Another example is the content translation recommendation tool, which uses a Wikimedia-created machine learning model to recommend articles for translation to editors.

Third-party models
Wikimedia also uses a number of third-party models. These models are not hosted by Wikimedia but rather are accessed using an API. These models include:


 * TurnItIn copyright violation detection API for detecting copyright infringement of new content.
 * Google Translate and Yandex Translate APIs in the content translation tool to help human editors translate articles.

Machine learning modernization efforts
The ORES infrastructure was a major step forward for Wikimedia’s machine learning infrastructure and has served its purpose well for many years. However, as the breadth and scale of machine learning application needs increased, there became a need for a modernization effort. We are working on a multi-year effort to modernize Wikimedia’s machine learning systems and processes, building new systems that allow for the rapid, safe development and deployment of a wide variety of models.

Guiding principles

 * Scalable - One of the lessons we have learned from ORES is that any MLOps system maintained by Wikimedia must be quickly scalable. The machine learning needs of various teams at Wikimedia, users, and the community vary widely over time and the infrastructure must be able to adapt without significant redesign.
 * Agile - Adoption of modern software engineering best practices means moving away from the waterfall model that has been common practice in the past in model development. Instead, both process workflows and technical systems need to be designed to allow rapid iteration on models as they are developed. This means model creators (whether Wikimedia staff or volunteer technical contributors) can quickly see changes reflected in production, evaluate them, and use that information to improve the models.
 * Collaboration - Machine learning at Wikimedia operates in the open – our work is highly visible on-wiki and in our public source code repository, even when it is unfinished. Leaning into this superpower means creating social-technical systems that allow quick collaborations between staff and technical contributors.
 * Public Best Practice Example Of Ethical ML - Wikimedia’s open model of development and principles as a community organization means it is in a position of not only adopting ethical machine learning approaches but also publicly advocating for them and demonstrating their effectiveness. The goal is for the public to learn from our efforts around ethical machine learning, both successes and failures.
 * Human In The Loop - Wikipedia is, at the end of the day, a human-centric project. This s means that wherever machine learning models hosted by Wikimedia are being used on wikis, it should be clear and unambiguous to human users and contain an easily accessed path for human feedback.
 * Accessibility - As a collaborative community project, machine learning at Wikimedia should move beyond open source as an end goal and instead focus on accessibility. While open source remains a major principle, a focus on accessibility means making it easy for community members and technical contributors with different levels of skill and expertise in machine learning to participate in model development and governance.
 * Community Governance - Wikimedia supports the free knowledge movement, and in that spirit, communities must have control over machine learning models that affect them.

Benefits of modernization

 * Centralize model lifecycle management - Currently, machine learning models are hosted in different areas around the Wikimedia technical infrastructure. This dispersion makes the rapid, effective management of the lifecycle of machine learning models (i.e. development, iteration, deprecation) time consuming and haphazard. The goal is to centralize the management of machine learning models hosted by Wikimedia into a single infrastructure, allowing for fast, responsive iterations regardless of the model.
 * Rapid automated deployment and iteration - Model deployments at Wikimedia are too often slow and haphazard. Modern MLOps approaches and systems will make changing a model quick and safe, allowing for fast iterative development and responsiveness to user needs.
 * Regulatory compliance - With machine learning models increasingly under scrutiny by governments and regulators, a modern MLOps system will better situate Wikimedia to evaluate the impact of regulatory changes and respond quickly.
 * Scalability - Machine learning use at Wikimedia has grown rapidly over the years. Our modernization work is building the infrastructure to adapt to rapid scaling changes quickly, without significant system redesigns or software changes.
 * Wide range of use cases - ORES focused exclusively on hosting tree-based machine learning models. While powerful, there is a growing need to support a broader range of machine learning approaches and libraries, including GPU-powered computer vision.
 * Greater collaboration - Engaging with machine learning projects hosted by Wikimedia should be easy, well-documented, and fun. This means building in open, accessible ways that invite timely engagements from the community and that are quickly reflected in production. The question we are focused on is how long does it take a technical contributor interested in machine learning to meaningfully contribute?
 * Community governance - Communities need simple, clear, publicized pathways to govern the machine learning models that are affecting them. The modernization efforts is focused on building those pathways.

Machine learning workflow
Model development inside Wikimedia has in the past often fallen into a de facto waterfall approach: a product team asks for a model to be created and passes the request to a model creator (often a researcher) who creates a small scale model and then passes it to an engineering team to deploy. This approach has sometimes caused friction during the handover process and made iterated improvement slow. The Machine Learning team is working with the Research team and other model creators to implement a Day-One agile approach. In this new workflow, a machine learning engineer from the Machine Learning team works with a model creator from the first day of the project to work with them on model deployment. The goal is that the model is deployed on the first day it is created and at every new iteration during its development process so that issues that appear in production are identified and resolved early in the model creation process.

Lift Wing
A significant goal of the modernization effort is the creation of a new MLOps infrastructure able to handle the expanding needs of Wikimedia and the community. After investigating various options, the open source MLOps library Kubeflow was selected to be the core of the new infrastructure. This infrastructure is divided into two parts: Lift Wing and Train Wing.



Lift Wing is a production Kubernetes cluster hosting KServe, a serverless inference service. Lift Wing hosts the production machine learning models and responds to requests for predictions.

There are a number of benefits to this approach. First, Lift Wing provides a centralized, generic platform for managing all machine learning models hosted by Wikimedia, whether large deep learning language models or small unsupervised models. Second, KServe’s serverless approach provides more flexibility in model deployment. Models that are used more frequently are given more resources. It is even possible for models that are not currently in active use to lay dormant with minimal resource consumption and then be “spun up” when predictions are requested by users. Third, Lift Wing’s model deployment process allows for rapid iteration of models as they are developed, including dedicated sandbox and staging environments.

As part of a larger API initiative, Lift Wing will be accessible through Wikimedia’s API Gateway (api.wikimedia.org).

Migration from ORES to Lift Wing
As part of the infrastructure modernization, Lift Wing will eventually replace the existing ORES infrastructure. This migration will happen over eight stages:


 * Stage 1 - Place the ORES infrastructure into maintenance mode. ORES is still live and accessible and maintained by the Machine Learning team. However, development has slowed to allow resources to be shifted to the migration. This has already been completed.
 * Stage 2 - Copy all models hosted on ORES onto Lift Wing. The end result of this step is that all models currently hosted on ORES will also be accessible on Lift Wing. When this stage is complete, users can continue to make API calls to ORES or start the process of moving over to Lift Wing.
 * Stage 3 - Test Lift Wing in parallel with ORES. The goal for Lift Wing is to confirm that Lift Wing is producing the same outputs at ORES for users.
 * Stage 4 - Lift Wing goes “live”. At this stage both Lift Wing and ORES are live and available to users.
 * Stage 5 - ORES Deprecation Communication. All known users of ORES will receive a direct communication about the future deprecation of ORES. Relevant documentation will also include a note about the future deprecation with a timeline and links for additional information. We plan on the first communication to take place at least six months before ORES is deprecated.
 * Stage 6 - ORES user investigation and migration assistance. Not all users of ORES are documented. For this reason, the Machine Learning team will use usage data to attempt to identify and contact the remaining users of ORES. In addition, the team will provide users of ORES with migration assistance to Lift Wing.
 * Stage 7 - Lightswitch Test (proposed). The worst-case scenario is that the Machine Learning team is unable to contact an ORES user and thus they do not have a chance to migrate to Lift Wing before ORES is deprecated. For this reason, the team is exploring a light switch test, where ORES outputs are replaced with an error message containing the deprecation communication for an hour. With luck, this will identify additional ORES users early enough that they have ample time to migrate.
 * Stage 8 - ORES is shut down. Relevant documentation is adjusted to reflect the shutdown.

Current status and timeline
Lift Wing is currently live and hosting machine learning models. The copying of ORES models onto Lift Wing is well underway. However, ORES is not yet connected to the API Gateway, and until that is complete, the ORES migration timeline is paused.

Train Wing
As described above, Lift Wing is Wikimedia’s Kubernetes cluster for hosting and serving machine learning models. Train Wing is a second Kubernetes cluster running Kubeflow for model creation, experimentation, and development. Where direct operational access to Lift Wing is restricted to Wikimedia SREs and machine learning engineers, Train Wing will provide model creators beyond the team an environment for model development through a graphic user interface (GUI). From this GUI, researchers or other model creators can build new machine learning models, manage model versions, iterate on model development, schedule automatic retraining, and (eventually) self-deploy models into production (i.e. deploy the model to Lift Wing).



On Train Wing’s training cluster, data scientists, and ML engineers can define their entire model training process in functional Python code, and Kubeflow will orchestrate the resource management process for them.

Current status and timeline
Train Wing is currently planned for deployment as part of the new shared Data Science and Engineering (DSE) kubernetes cluster. Work on this new cluster is underway but in its early stages. The goal is for release within the next of the current fiscal year.

Accessibility
Machine learning at Wikimedia cannot be a black box, as it so often is in other applications around the internet. We must not ask users to trust us when it comes to the safety, ethics, or capabilities of our models, but in the “wiki way”, provide evidence of our claims. As a community government project, we must focus on making the experience of collaborating and engaging with machine learning as good an experience as possible.

A central part of accomplishing a great user experience for our work is creating easy points of accessibility for technical contributors, community members, and the public. This means changing how we do work and how we document work. There are three elements of accessibility in the modernization effort: model cards, Gitlab, and open data access.

Model cards
First proposed in Mitchell et al 2018, model cards are a form of short documentation describing a machine learning model. The purpose of the documentation is to provide users and the public information about the model and its uses. Wikimedia has over 100 machine learning models in production, but there is at present no consistent way for interested people to learn about specific models and discuss them. As part of the modernization efforts, model cards are being developed for all machine learning models hosted on Lift Wing.

The model cards will be shared on meta.wikimedia.org. Since the primary audience for the model cards are Wikipedia editors, keeping the model cards as a wiki page maintains a seamless user experience with the same UI, features (i.e. watchlists, talk pages, etc.), and workflows they already are familiar with. The goal is for the model card to be a first stop for people interested in discussing or contributing to a machine learning model.

Since each model card is a Wiki article, each article’s talk page provides a familiar on-wiki location for discussion of specific models. The watchlist feature also allows for Wikipedians to monitor changes to model cards (and thus to be informed of changes to the models) directly from their regular Wikipedia accounts.

Currently a single model card has been created as a proof of concept. Final revisions are currently underway, after which we will begin rolling out model cards for all models on Lift Wing.

Code and data access
Currently, the code and data for creating models is spread over a number of repositories on Github and Gerrit. Experience has demonstrated that it is not easy or simple for a potential contributor to understand how to access or replicate the model code.

An objective for the modernization efforts is that user experience for engaging with the models will improve to the point that a technical contributor should be able to identify the model they want to contribute to, run the code in their own environment if they wish, and contribute changes back into our code repository.

Community governance
Communities need the ability to govern the models that impact them. In practical terms, this means communities need to be able to request changes to models, models be turned off, and/or models be created. How this happens needs to not be through personal connections between community members and Wikimedia staff but through public, documented processes that are resilient to the coming and going of staff and volunteers.

This leads to some currently unanswered questions. For example, what constitutes proof of consensus by a community? Can a single community member request a model be turned off or changed? Can a member of one Wikipedia community request a change that affects another Wikimedia community?

We hope to explore and answer those questions in time and develop processes that allow communities the control they need.

Current status and timeline
Currently planned for development, but work has not yet started.