Ịmụta Ihe n'Ụgbọ

From mediawiki.org
This page is a translated version of the page Machine Learning and the translation is 64% complete.

Nnọọ na ibe mbụ nke otu Wikimedia Foundation's Machine Learning.

Our team oversees the development and management of machine learning models for end users, as well as the infrastructure required for designing, training, and deploying these models.

Current projects

For archived projects, see this list.

Contact us

Nwere ajụjụ? Chọrọ ịgwa ndị otu ma ọ bụ obodo ndị ọrụ afọ ofufo anyị gbasara mmụta igwe? Nke a bụ ụzọ kachasị mma iji jikọọ anyị.

Team Chat

Discuss machine learning and watch the team work joining our public IRC chatroom #wikimedia-ml connect on irc.libera.chat.

Active Work Board

Have a particular task you want to discuss or work on, join our public Phabricator board. Gaa na bọọdụ ọrụ anyị

Gịnị bụ ihe ọhụrụ?

    • The logo detection model is being moved to the experimental namespace. This will be a moment where we can test the model in a production setting to make sure that it has the performance that we want. This work is being coordinated really closely with the structured content team to make sure it meets their needs.
    • ML and research Airflow Pipeline Sprint has started this week. This is a effort to see how we can use Airflow pipelines and GPS on the existing Hadoop Cluster to train models.
    • Work continues on the Cassandra clusters that will be part of the caching solution.
    • Work continues on the Hugging Face model server image. This is an effort that we're working on that will allow us to easily host many of the models that are available on Hugging Face onto Lift Wing directly. This is actually a really interesting project because it's an easy way for the community to experiment with the models that they might want to host on Lift Wing and even propose models that they might want to have on Lift Wing.
    • We are working with the data center operations team on the procurement of new machines with GPUs. The current status is that we are working with the vendor to an issue around the availability of a particular server configuration and looking at some alternatives.
    • Chris on vacation. No update this week.
    • Big win for the week: Our HuggingFace Docker image patch has been reviewed and approved. This Docker image allows us to deploy HuggingFace models quickly onto LiftWing, in a way that will speed up all development process going forward.
    • Continuing to integrate the logo-detection prototype into KServe custom model-server that will be hosted on LiftWing
    • Work on revertrisk-multilingual GPU image, ensure the RRML model is compatible with torch 2.x (e.g. predictions are correct as the model was trained with 1.13)
    • We are still working on the logo detection model for Wikimedia Commons. The current status is that we have confirmed with the product team working on the feature that the model is returning the expected outputs. The next step is to look at input validation and image size limits. The open question we are discussing with the product team is whether resizing of images should be done inside Lift Wing or prior to the image being sent to Lift Wing. Resizing is important because the logo detection model expects an image of a certain size.
    • Work / banging our heads continues on the pytorch base image. For those following along, we are working with Service Ops to make a reasonably sized docker image that contains pytorch and ROCm support. If the base image is too big it becomes a problem for our Docker registry and we are trying to be good stewards of that common resource. Turns out it is harder than we thought.
    • More work is happening on Lift Wing caching. We are still working out how we want Lift Wing (specifically KServe's Istio) to talk to the Cassandra servers.
    • A new version of the Language Agnostic Revert Risk model has been deployed to staging and is currently doing load testing.
    • More work on the HuggingFace model server integration with Lift Wing. Once we crack this we will be able to deploy most models on HuggingFace quickly.
    • We stood up a Wikimedia community of practice for ML this week. The goal is to provide a space for all the folks around WMF that are working on the technical side of ML to share insights and learn together. Currently there are folks from a number of teams in the community of practice, including ML, Research, Content Translation, and others.
    • We are still waiting for our test GPUs (one server with two MI210s) to be installed in the data center. Once we test this configuration works well in our infrastructure (a few days of testing max) we can continue with the full order.
    • I am starting work on a white paper that surveys all the work Wikimedia'verse is doing around AI, this includes models WMF hosts, advocacy work done by WMF, work by volunteers, etc. If you know some people I should talk with, definitely reach out.
    • We are really pushing hard on getting caching deployed. The reason is that with caching, it means we can really take full advantage of the CPUs we have now by pre-caching predictions. The end result for users is that a prediction that might take 500ms would take a fraction of that time. The exact current status of the work is that our SRE is trying to get Lift Wing to speak to the Cassandra servers.
    • Our SLO dashboards need to be fixed. They are giving some wild numbers that are clearly incorrect. Our team is working with folks to figure it out.
    • Work on the Logo Detection model continues. The request to host this model comes from the Structured Content team. The goal is to predict logos in Wikimedia Commons because logos account for a significant chunk of files that receive a deletion request.
    • We are continuing to try to load the HuggingFace model server onto Lift Wing. When completed this offers the potential to load a model hosted on HuggingFace into Lift Wing quickly and easily, opening a huge new library of models for folks to use.
    • We are working on deploying a model for the Structured Content team that detects potentially copyrighted image uploads on Commons, specifically images with logos. (T358676)
    • We are continuing to work on hosting HuggingFace model server on Lift Wing. This would make deploying HuggingFace models super simple.
    • We have deployed Dragonfly cache on Lift Wing to help with Docker image sizes.
    • Our Cassandra databases for an eventual caching system is in production. Still more work to do but its a good start.
    • General updates and bug fixes.
    • Sorry for the update being one day late, Chris (I) attended the Strategy meeting in NYC and is writing this update from the plane back.
    • An issue we are facing is that WMF's docker registry is set up for smaller docker images (~2GBs). However, the docker images of the team can get pretty big because of ROCm/Pytorch (~6-8GB). We are working out how to resolve that. There a number of strategies can do, from optimizing the image layers better to requesting the max docker image size limit to be increased.
    • As a partial solution to the above, we installed Dragonfly, which is a peer-2-peer layer between our Kubernetes cluster and the WMF docker registry. We will also work on some other improvements.
    • We are continue working on including HuggingFace's prebuilt model server into Lift Wing. This would mean we could quickly deploy any model on HuggingFace with all the optimizations HuggingFace provides. (T357986). This isn't done yet but it would be really nice to have.
    • Fixing a bug reported about inconsistent data type for article quality scores on ptwiki. The error as because of the mixed schema of the responses returned by ORES.(T358953)
    • We made our server hardware request for the next fiscal year. The short version is: GPUs.
    • GPU order is underway. We are in the process of ordering a series of servers to use for training and inference. Each server will have two MI210 AMD GPUs. Most will be reserved for model inference (specifically, larger models like LLMs), but we will use two servers (4 GPUs) to create a model training environment. This model training environment will start very small and scrappy but will hopefully grow into a place for automated retraining of models and the standardization of model training approaches. The next steps are a single server will on its way to our data center, once this is tested we will make the full order.
    • Work on caching for Lift Wing continues. We have in the process of making a large order of GPUs. However, to optimize our resource use, one of the best strategies we can do is conduct model inference using our existing CPUs. This is not always possible, for example cases when the set of possible model inputs is not finite. However, in cases where the possible inputs are finite we can cache the predictions for those inputs and then serve them to users rapidly with minimal compute used. This is a similar system to that which was originally used on ORES.
    • The pentesting of Lift Wing continues. The testing is being done by a third party contractor and is examining our vulnerability to malicious code.
    • Wikimedia's branding team has come out with some suggestions for the naming of machine learning tools and models. The hope is that our naming is more systematic and less ad-hoc.
    • Chris helped organize and attend an event in Bellagio, Italy to craft a research agenda for researchers interested in Wikipedia. That research agenda is available here.