Machine Learning

From mediawiki.org

Welcome to the Wikimedia Foundation's Machine Learning Team homepage.

Our team oversees the development and management of machine learning models for end users, as well as the infrastructure required for designing, training, and deploying these models.

Current projects

For archived projects, see this list.

Contact us

Have a question? Want to talk to the team or our community of volunteers about machine learning? Here are the best ways to connect with us.

Team Chat

Discuss machine learning and watch the team work joining our public IRC chatroom #wikimedia-ml connect on irc.libera.chat.

Active Work Board

Have a particular task you want to discuss or work on, join our public Phabricator board. Visit our work board

What's new?

    • We are working on deploying a model for the Structured Content team that detects potentially copyrighted image uploads on Commons, specifically images with logos. (T358676)
    • We are continuing to work on hosting HuggingFace model server on Lift Wing. This would make deploying HuggingFace models super simple.
    • We have deployed Dragonfly cache on Lift Wing to help with Docker image sizes.
    • Our Cassandra databases for an eventual caching system is in production. Still more work to do but its a good start.
    • General updates and bug fixes.
    • Sorry for the update being one day late, Chris (I) attended the Strategy meeting in NYC and is writing this update from the plane back.
    • An issue we are facing is that WMF's docker registry is set up for smaller docker images (~2GBs). However, the docker images of the team can get pretty big because of ROCm/Pytorch (~6-8GB). We are working out how to resolve that. There a number of strategies can do, from optimizing the image layers better to requesting the max docker image size limit to be increased.
    • As a partial solution to the above, we installed Dragonfly, which is a peer-2-peer layer between our Kubernetes cluster and the WMF docker registry. We will also work on some other improvements.
    • We are continue working on including HuggingFace's prebuilt model server into Lift Wing. This would mean we could quickly deploy any model on HuggingFace with all the optimizations HuggingFace provides. (T357986). This isn't done yet but it would be really nice to have.
    • Fixing a bug reported about inconsistent data type for article quality scores on ptwiki. The error as because of the mixed schema of the responses returned by ORES.(T358953)
    • We made our server hardware request for the next fiscal year. The short version is: GPUs.
    • GPU order is underway. We are in the process of ordering a series of servers to use for training and inference. Each server will have two MI210 AMD GPUs. Most will be reserved for model inference (specifically, larger models like LLMs), but we will use two servers (4 GPUs) to create a model training environment. This model training environment will start very small and scrappy but will hopefully grow into a place for automated retraining of models and the standardization of model training approaches. The next steps are a single server will on its way to our data center, once this is tested we will make the full order.
    • Work on caching for Lift Wing continues. We have in the process of making a large order of GPUs. However, to optimize our resource use, one of the best strategies we can do is conduct model inference using our existing CPUs. This is not always possible, for example cases when the set of possible model inputs is not finite. However, in cases where the possible inputs are finite we can cache the predictions for those inputs and then serve them to users rapidly with minimal compute used. This is a similar system to that which was originally used on ORES.
    • The pentesting of Lift Wing continues. The testing is being done by a third party contractor and is examining our vulnerability to malicious code.
    • Wikimedia's branding team has come out with some suggestions for the naming of machine learning tools and models. The hope is that our naming is more systematic and less ad-hoc.
    • Chris helped organize and attend an event in Bellagio, Italy to craft a research agenda for researchers interested in Wikipedia. That research agenda is avaliable here.