Jump to content

Machine Learning

From mediawiki.org
Translate this page; This page contains changes which are not marked for translation.

Welcome to the Wikimedia Foundation's Machine Learning Team homepage.

Our team oversees the development and management of machine learning models for end users, as well as the infrastructure required for designing, training, and deploying these models.

Current projects

For archived projects, see this list.

Contact us

Have a question? Want to talk to the team or our community of volunteers about machine learning? Here are the best ways to connect with us.

Team Chat

Discuss machine learning and watch the team work joining our public IRC chatroom #wikimedia-ml connect on irc.libera.chat.

Active Work Board

Have a particular task you want to discuss or work on, join our public Phabricator board. Visit our work board

What's new?

    • We are continuing to integrate the article-country model into Liftwing. The article country model predicts which countries any particular model will be applicable for, and it's an extension of the article-topic model, which we have used for years.
    • We're trying different approaches to build vllm (a high-throughput and memory-efficient system designed for serving large language models) and ROCm (the code that allows the CPU to talk to AMD GPUs) with Ubuntu. This is part of the work of making production LLMs on Liftwing possible.
    • We're currently working on configuring the ML Lab servers. These are for model training.
    • Updated the rec-api image deployment model. Deployed the reference need model to production.
    • Following up on recurring issue reported by the Structured Content team: The MediaDetection API can access the logo-detection endpoint via mwdebug1001.eqiad.wmnet and mwdebug2001.codfw.wmnet, but can't access it on k8s-mwdebug
    • Adding logo-detection documentation to the API portal docs.
    • Investigating occasional slow queries on LiftWing when using some RevScoring models
    • Continuing the remaining work on the pre-save revertrisk model. This model is designed to provide a vandalism prediction before an edit is saved to Wikipedia (and thus doesn't have a revision ID)
    • Work continues on upgrading kserve to 0.13
    • Initializing install config for the GPU hosts in eqiad
    • Apologies on my end for the delay in updates, I got covid.
    • Working continues on the Logo Detection model. We made an example logo-detection model-server that processes base64 image objects instead of image URLs and sent it to the Structured Content team for their thoughts.
    • Work continues on the HuggingFace model server.
    • General bug fixes and improvements.
    • Work continues on the Logo Detection model. The issues we discussed this week as a team whether or not the encoded image would be sent directly to LiftWing. Alternatively, we'd receive a URL of the image's location, which we would then allow LiftWing to access/download from. This matters because it affects the size of the REST payload, particularly with batched requests.
    • We are still working on the HuggingFace model server GPU issue (i.e. it won't recognize our AMD GPU). There are a number of possibilities as to why, but we want this resolved before we finalize our order for this fiscal year.
    • A number of misc bug fixes and improvements.
    • Our big Istio refactoring is underway (slides)! This refactoring will allow us to remove a lot of networking logic out of individual model containers. For example, currently if there was some changing to the `discovery.wmnet` endpoint (WMF's internal endpoint for APIs), we'd have to update hundreds of individual model containers and redeploy them. This refactoring removes this need entirely.
    • We've been deploying AMD's open source software stack (ROCm) inside each k8s node, but we suspect this has been unnecessary (and actually causing some problems) because PyTorch already has a version of ROCm included in the library. This work is being prioritized because completing it is a requiring for making the large GPU order have have planned later in the quarter.
    • We are preparing a patch that enables logo-detection model server to access external URLs using internal k8s endpoints. This is part of some of the changes we needed to make to deploy the model.
    • Continuing to test the HuggingFace model server image on our Lift Wing nodes. This work was paused for the week while the engineer attended the Wikipedia Hackathon in Tallinn.
    • Lift Wing caching work has been paused until the Istio refactoring is complete.
    • Reviewing and testing the big patch for the ORES extension. The ORES extension provides a way to see the probability that a particular edit is reverted for all edits on the recent changes page of many Wikis. The new revert risk model into the extension so that volunteers can use that new model when hunting down potential vandalism.
    • We're still doing some tweaks for the image processing for the logo detection model, specifically restricting the image processing to trusted domains that host Wikimedia comments images.
    • We have a big Istio (Istio is the service mesh for k8s that controls how microservices share data with each other) refactoring proposal under discussion. On Tuesday the team will have a special meeting to discuss the proposed refactoring and decide on the path forward. I'll post the slides next week if people are interested.
    • The logo detection model is being moved to the experimental namespace. This will be a moment where we can test the model in a production setting to make sure that it has the performance that we want. This work is being coordinated really closely with the structured content team to make sure it meets their needs.
    • ML and research Airflow Pipeline Sprint has started this week. This is a effort to see how we can use Airflow pipelines and GPS on the existing Hadoop Cluster to train models.
    • Work continues on the Cassandra clusters that will be part of the caching solution.
    • Work continues on the Hugging Face model server image. This is an effort that we're working on that will allow us to easily host many of the models that are available on Hugging Face onto Lift Wing directly. This is actually a really interesting project because it's an easy way for the community to experiment with the models that they might want to host on Lift Wing and even propose models that they might want to have on Lift Wing.
    • We are working with the data center operations team on the procurement of new machines with GPUs. The current status is that we are working with the vendor to an issue around the availability of a particular server configuration and looking at some alternatives.
    • Chris on vacation. No update this week.
    • Big win for the week: Our HuggingFace Docker image patch has been reviewed and approved. This Docker image allows us to deploy HuggingFace models quickly onto LiftWing, in a way that will speed up all development process going forward.
    • Continuing to integrate the logo-detection prototype into KServe custom model-server that will be hosted on LiftWing
    • Work on revertrisk-multilingual GPU image, ensure the RRML model is compatible with torch 2.x (e.g. predictions are correct as the model was trained with 1.13)
    • We are still working on the logo detection model for Wikimedia Commons. The current status is that we have confirmed with the product team working on the feature that the model is returning the expected outputs. The next step is to look at input validation and image size limits. The open question we are discussing with the product team is whether resizing of images should be done inside Lift Wing or prior to the image being sent to Lift Wing. Resizing is important because the logo detection model expects an image of a certain size.
    • Work / banging our heads continues on the pytorch base image. For those following along, we are working with Service Ops to make a reasonably sized docker image that contains pytorch and ROCm support. If the base image is too big it becomes a problem for our Docker registry and we are trying to be good stewards of that common resource. Turns out it is harder than we thought.
    • More work is happening on Lift Wing caching. We are still working out how we want Lift Wing (specifically KServe's Istio) to talk to the Cassandra servers.
    • A new version of the Language Agnostic Revert Risk model has been deployed to staging and is currently doing load testing.
    • More work on the HuggingFace model server integration with Lift Wing. Once we crack this we will be able to deploy most models on HuggingFace quickly.
    • We stood up a Wikimedia community of practice for ML this week. The goal is to provide a space for all the folks around WMF that are working on the technical side of ML to share insights and learn together. Currently there are folks from a number of teams in the community of practice, including ML, Research, Content Translation, and others.
    • We are still waiting for our test GPUs (one server with two MI210s) to be installed in the data center. Once we test this configuration works well in our infrastructure (a few days of testing max) we can continue with the full order.
    • I am starting work on a white paper that surveys all the work Wikimedia'verse is doing around AI, this includes models WMF hosts, advocacy work done by WMF, work by volunteers, etc. If you know some people I should talk with, definitely reach out.
    • We are really pushing hard on getting caching deployed. The reason is that with caching, it means we can really take full advantage of the CPUs we have now by pre-caching predictions. The end result for users is that a prediction that might take 500ms would take a fraction of that time. The exact current status of the work is that our SRE is trying to get Lift Wing to speak to the Cassandra servers.
    • Our SLO dashboards need to be fixed. They are giving some wild numbers that are clearly incorrect. Our team is working with folks to figure it out.
    • Work on the Logo Detection model continues. The request to host this model comes from the Structured Content team. The goal is to predict logos in Wikimedia Commons because logos account for a significant chunk of files that receive a deletion request.
    • We are continuing to try to load the HuggingFace model server onto Lift Wing. When completed this offers the potential to load a model hosted on HuggingFace into Lift Wing quickly and easily, opening a huge new library of models for folks to use.
    • We are working on deploying a model for the Structured Content team that detects potentially copyrighted image uploads on Commons, specifically images with logos. (T358676)
    • We are continuing to work on hosting HuggingFace model server on Lift Wing. This would make deploying HuggingFace models super simple.
    • We have deployed Dragonfly cache on Lift Wing to help with Docker image sizes.
    • Our Cassandra databases for an eventual caching system is in production. Still more work to do but its a good start.
    • General updates and bug fixes.
    • Sorry for the update being one day late, Chris (I) attended the Strategy meeting in NYC and is writing this update from the plane back.
    • An issue we are facing is that WMF's docker registry is set up for smaller docker images (~2GBs). However, the docker images of the team can get pretty big because of ROCm/Pytorch (~6-8GB). We are working out how to resolve that. There a number of strategies can do, from optimizing the image layers better to requesting the max docker image size limit to be increased.
    • As a partial solution to the above, we installed Dragonfly, which is a peer-2-peer layer between our Kubernetes cluster and the WMF docker registry. We will also work on some other improvements.
    • We are continue working on including HuggingFace's prebuilt model server into Lift Wing. This would mean we could quickly deploy any model on HuggingFace with all the optimizations HuggingFace provides. (T357986). This isn't done yet but it would be really nice to have.
    • Fixing a bug reported about inconsistent data type for article quality scores on ptwiki. The error as because of the mixed schema of the responses returned by ORES.(T358953)
    • We made our server hardware request for the next fiscal year. The short version is: GPUs.
    • GPU order is underway. We are in the process of ordering a series of servers to use for training and inference. Each server will have two MI210 AMD GPUs. Most will be reserved for model inference (specifically, larger models like LLMs), but we will use two servers (4 GPUs) to create a model training environment. This model training environment will start very small and scrappy but will hopefully grow into a place for automated retraining of models and the standardization of model training approaches. The next steps are a single server will on its way to our data center, once this is tested we will make the full order.
    • Work on caching for Lift Wing continues. We have in the process of making a large order of GPUs. However, to optimize our resource use, one of the best strategies we can do is conduct model inference using our existing CPUs. This is not always possible, for example cases when the set of possible model inputs is not finite. However, in cases where the possible inputs are finite we can cache the predictions for those inputs and then serve them to users rapidly with minimal compute used. This is a similar system to that which was originally used on ORES.
    • The pentesting of Lift Wing continues. The testing is being done by a third party contractor and is examining our vulnerability to malicious code.
    • Wikimedia's branding team has come out with some suggestions for the naming of machine learning tools and models. The hope is that our naming is more systematic and less ad-hoc.
    • Chris helped organize and attend an event in Bellagio, Italy to craft a research agenda for researchers interested in Wikipedia. That research agenda is available here.