Talk:Machine Learning

About this board

A liveblog and forum about machine learning at Wikimedia. Create new topics to ask questions or post updates.

I am reworking the team page

3
CAlbon (WMF) (talkcontribs)

I am reworking the layout of the team page. The goal is to made a new section to post my weekly update on the work of team. For this year this update has been an internal update as part of OKRs and Annual Planning, but I realized a few months ago that if I added some additional context to the updates, it would be valuable for the community to see.

HaeB (talkcontribs)

Thanks for reviving these regular public updates, they are valuable indeed!

Regarding context: Yes, including explanations, links etc. is great. Just don't let the perfect be the enemy of the good - even a barebones update that requires some googling (or Phabricator searching) to understand is better than none.

CAlbon (WMF) (talkcontribs)

Thanks, this is good to hear. I'll add some phab links where I can and will try to provide more context.

Reply to "I am reworking the team page"
Seawolf35 (talkcontribs)

Recently, I have found that on the Wikipedia recent changes list the edit highlighting by ORES has disappeared. Is this because of the new open source infrastructure? Any solutions other than to just wait.

CAlbon (WMF) (talkcontribs)
Seawolf35 (talkcontribs)

Ok, thanks!

CAlbon (WMF) (talkcontribs)

Hey Seaworlf35, the problem has been fixed and edit highlighting should be visible again.

Seawolf35 (talkcontribs)

Ok, thank you

Reply to "ORES Revision Scoring"

Exploring Wikidata Anti-Vandalism

1
BrokenSegue (talkcontribs)

I'm looking into the feasibility of designing an anti-vandal bot for Wikidata (potentially using OpenAI tech if I can secure funding). I understand that you have built tools for handling revscoring in the general case but Wikidata items are a special snowflake in some senses. Are there tools being built that would work for Wikidata. [[Machine learning models]] lists "wikidatawiki-goodfaith" which is I assume what I want. But is that publicly usable? Thanks for any direction.

Reply to "Exploring Wikidata Anti-Vandalism"

How is ORES trained?

4
Novem Linguae (talkcontribs)

Howdy folks. 1) Was just wondering how ORES is trained? For example, the "Vandalism" tags in PageTriage's Special:NewPagesFeed. Is there software somewhere where volunteers are presented with various pages and click a "vandalism yes/no?" button? If so where is the software? I'd like to check it out. 2) What are the PageTriage ORES configuration settings such as false positive target rate? I assume that this is a setting that can be adjusted up/down, which I assume is how anti-vandalism bot ClueBot NG achieves such a good false positive rate. I assume it's a tradeoff between false positives, and letting stuff slip through the cracks. 3) Any other hints about how ORES works in relation to PageTriage? I'll probably write some documentation about it. Thanks.

Ponor (talkcontribs)

I believe this is where edits are labeled, per wiki and on some quite old selection of edits. IDK if some other training options are in place. Would love to learn too!

Tzusheng (talkcontribs)

Hey @Novem Linguae and @Ponor, me and some researchers are currently working on a project where we build a system that facilitates curating up-to-date data for training and evaluating ML models used in Wikipedia, including but not limited to ORES. We plan to recruit a small group of people for pilot testing around June. Please let me know if you're interested in participating or learning more about the project. Thanks!

CAlbon (WMF) (talkcontribs)

Hi @Novem Linguae, there is some general information here. The models are sometimes trained using human curated training data (like @Ponor mentioned). Other times, data such as whether an edit was reverted is used. The models just output probability, the thresholds are hardcoded in the mediawiki extension itself.


Additionally, we are planning on deprecating the current ORES/Revscoring models in favor of more modern models such as RevertRisk and Outlink Topic Model which cover multiple languages and take advantage of tool such as BERT. The ORES models will still be available for legacy reasons but we won't be updating them.

Reply to "How is ORES trained?"

Recommendations for (introductory) machine learning course

3
Michael Große (WMDE) (talkcontribs)

Hey 👋

we at the WMDE Wikidata team will soonish (this year?) introduce a new feature that will affect how we evaluate the quality of Items. That means, we will probably need to retrain / update the articlequality model for Wikidata and maybe others.

For that reason in particular, I want to get somewhat deeper machine learning knowledge. I've done Andrew Ng's Introduction to Machine Learning course a few years ago, before we last retrained and extended that model for Wikidata.

So I was wondering if you had any courses you would recommend? Especially for someone like me, who wants to upskill in order to work with / contribute to the articlequality model in particular and WMF/Wikimedia ML infrastructure in general.


Thanks!

CAlbon (WMF) (talkcontribs)

Hey Michael! I don't know about courses, but Sebastian Raschka is probably the best ML educator out there right now. He has a course and a book I believe.

Michael Große (WMDE) (talkcontribs)

Thank you! I will look him up :)

Reply to "Recommendations for (introductory) machine learning course"

Damaging Filter Disappears on the English Wikipedia?

3
Tzusheng (talkcontribs)

I noticed that the damaging filter disappeared from the Recent Changes page on the English Wikipedia this morning (Eastern Time). Only the user intent prediction remains. Is there a rationale or announcement behind this change? Thanks!

CAlbon (WMF) (talkcontribs)

Hi Tzusheng, my apologies on the late reply, I was out of the office. I haven't heard anything about a change and on our end the ML team hasn't change anything. That said also don't see the damaging filter in the Recent Changes page. Let me investigate and get back to you.

CAlbon (WMF) (talkcontribs)
Reply to "Damaging Filter Disappears on the English Wikipedia?"

Machine Learning Modernization Project

1
CAlbon (WMF) (talkcontribs)

Hi All! I'm back from vacation!


After far too long we just published a page on our machine learning modernization work, which includes modern serving, model cards, and other plans. I hope this sparks some interesting discussions with you all about what we are working on and how we can work together.

Reply to "Machine Learning Modernization Project"
CAlbon (WMF) (talkcontribs)

My apologies but no update this week because I have been out sick.

Michael Große (WMDE) (talkcontribs)

Hope you get fully well soon!

Reply to "No Update This Week"

Machine Learning Weekly Update Dec 7, 2022

1
CAlbon (WMF) (talkcontribs)
  • We have hit another milestone for Lift Wing. Our ORES model infrastructure hosts ~110 machine learning models and for the first time all of those models are also publicly available on Lift Wing through the API Gateway. We still need to work out the details of reasonable rate limiting and optimizing performance (maybe use the API Gateway as a simple cache?), but you can access all the models right now without any internal WMF permissions. We will have some tutorials up soon for the community and an ask for testers to help us out.
  • To be clear: if folks are currently using ORES, nothing will change with ORES for at least an entire year. We are working on a complete year-long migration plan for ORES users to Lift Wing that includes outreach, tutorials, and technical support, and that plan will only begin once Lift Wing is officially launched in a few months.
  • After talking to AI ethics experts, wiki community members, WMF staff members, and frankly anyone else who would talk to us, we are working on generating model cards for all models on Lift Wing. The goal is for the model cards to be the main point of contact for questions, discussions, and ultimate governance around machine learning models hosted by WMF. The model cards will be individual articles on Mediawiki.org to make it easy for the community to use the tools they are familiar with. We should have some things to show everyone very soon.
Reply to "Machine Learning Weekly Update Dec 7, 2022"

Machine Learning Weekly Update Nov 30, 2022

1
CAlbon (WMF) (talkcontribs)
  • NLLB-200 deployment
    • Major process continues on getting the NLLB-200 deployment live and running for the Content Translation Tool. I am confident we will make the January 1st deadline.
  • API Gateway
    • Tobias and Hugh made some major processes over the last two days and they are working on a patch that will allow the API Gateway to be used with Lift Wing. Specifically, when the patch is tested and rolled out Lift Wing will effectively be silently soft-launched on the API Gateway, making over 100 machine learning models available to everyone. The timeline for the patch being pushed to production is a few days.
    • After the patch is released I will start publishing some tutorials on getting started using Lift Wing and will ask folks both inside WMF and the community to start experimenting to help find bad user experiences and technical bugs.
  • Add-A-Link
    • Steady progress on the Add-A-Link models. Kevin continues to train and deploy new models while evaluating their performance.
  • Model Cards
    • We are having weekly standup meetings on model cards as we start to make them. We should start with the first of the model cards published in the next two weeks.
  • Lift Wing
    • The current focus of the Lift Wing work is on model performance and the k8s 1.23 upgrade.
    • Model Performance: Some of the larger models we are currently working with the Research team on are ~4GB loaded, which is causing prediction times to be over ten seconds. This is obviously too slow for any real-world case and we are exploring which out of a wide variety of strategies for improvement is best, from breaking the large model into smaller ones, optimizing the structure of the models, increasing the number of pods, etc.
    • K8s 1.23 update: Luca and Yannis are working through a list of tasks as part of the 1.23 update. We are making solid progress but it is also a major task.
  • DSE Cluster
    • Work has now started on tackling what we have called the “Kerbarrier”, which is the fact that Kubernetes and Hadoop use very different security models. Kubernetes uses a certificate-based approach while Hadoop uses a symmetric key cryptographic approach (called Kerberos). Building a way of bridging the gap between these two approaches so nodes on clusters can access HDFS has been a major challenge we have to know we would need to solve eventually and one of the reasons for starting the DSE Cluster experiment.
Reply to "Machine Learning Weekly Update Nov 30, 2022"