Topic on Talk:Machine Learning

Machine Learning Weekly Update Nov 17, 2022

1
CAlbon (WMF) (talkcontribs)

I have let writing this weekly update slide and my apologies for that. Here is this week's weekly update and I will make an effort keep posting these since folks seem to find them useful.

  • NLLB-200 model
    • Context: Currently the Content Translation Tool uses Meta’s NLLB-200 model for translation between smaller languages. The model is already in production. However, the model is currently hosted on Meta’s AWS account and we have been informed that hosting will end Jan 1st. The goal has been to migrate the NLLB-200 model onto WMF’s AWS account before Jan 1st to prevent any loss of service in the Content Translation Tool.
    • We have the model working on WMF’s AWS Sagemaker. We can hit it and get a prediction. It isn’t MVP yet, but it proves we can do it. Now it is about configuring things correctly and connecting the Content Translation Tool to it. We are on track for making the deadline of Jan 1st.
    • I have purposely separated the work on the NLLB-200 model into two parts: 1) resolving the crisis of having the model on Meta’s AWS account when that account is going to end in 45 days and 2) the ongoing maintenance, development, and support of both the AWS instance and NLLB specifically. The benefit is that we were able to move fast and are currently on track to resolve the crisis before the deadline. The cost is that there are important conversations about supporting this model and supporting AWS that we aren’t having, but will eventually have to be worked out.
    • There is also the issue of cost, since we don't use AWS as a regular course of work, we don't have large budget for it. I'm monitoring the cost and we will see how things go.
    • It is worth noting that moving the NLLB-200 model off AWS will spark a larger conversation around the WMF’s policy towards open source. WMF has long used AMD GPUs because their drivers and software are open sources. However, many large models, including NLLB-200 at best assume but at worst require NVIDIA GPUs, which are only partially open source.
  • AWS Gateway
    • Context: There has been a plan for over two years for an API Gateway (api.wikimedia.org) where the public and other users can have access to all the APIs provided by WMF. We are working on connecting Lift Wing to that API Gateway as a prerequisite for an MVP launch.
    • This is moving forward, but slowly. Notably, Tobias had to devote 50% of the time he was working on the AWS Gateway with Hugh to work on the NLLB-200 model.
  • Add-A-Link
    • Context: As part of the Structured Tasks project in the Product Department, Add-A-Link uses machine learning to recommend easy edits to new editors to make the onboarding process easier and more mobile-friendly.
    • 6th round of model training has started, making it about ~120 models trained out of ~300. The models are live and in production, but not on Lift Wing. The reason was that the project started before Lift Wing wasn't active. However, migrating the models to Lift Wing and decommissioning the current model serving system is in our plans for the future.
  • Model Cards
    • Context: As the first step in our efforts to be a best practice public example of applied ethical ML, we are creating a wiki model card for every model hosted on Lift Wing. We have been working on a proof of concept for a few months and are now starting on rolling out model cards into production.
    • We had a kickoff meeting for working on production model cards last week. Currently, the team (Chris, Hal, Issac, and Kevin) have been discussing some of the practicalities of the model card design (i.e. do we even need programmatic content before models are trained on Train Wing?)
    • Following agile, the current step is for Kevin to try to make one model card and we’ll discuss and iterate on the card next week.
  • Lift Wing
    • Context: Lift Wing is the Kubernetes cluster for hosting and serving production machine learning models at WMF. It is close to an MVP launch.
    • Work continues on experimenting with using Benthos for streaming Lift Wing model predictions to the Eventgate. Specifically, working with the Observability team on monitoring the application.
  • DSE Cluster
    • Context: The DSE Cluster is an experiment with a cross-team shared cluster between Machine Learning and Data Engineering. The goal is to benefit from economies of scale and cross-team experience by building a single cluster that both hosts machine learning and data engineering tasks.
    • Two weeks ago we were at a decision point: does the DSE cluster fork from the greater SRE k8s processes and systems so it can upgrade from 1.16 to 1.23. We’ve made that decision, the DSE Cluster is not going to fork. The cost of the fork is too high for any real benefits. Instead, the DSE Cluster will work with WMF’s other teams with clusters to upgrade to 1.23 together. Luca and Janis are leading this effort.
    • Importantly and more broadly, a k8s special interest group (SIG) has been created with the goal of coordinating and organizing cross-team k8s efforts across the Foundation. This group met yesterday and made a number of decisions on the structure of the group and its scope.
Reply to "Machine Learning Weekly Update Nov 17, 2022"