Moderator Tools/Automoderator

The Moderator Tools team is exploring a project to build an 'automoderator' tool for Wikimedia communities. It would provide automated prevention or reversion of bad edits based on scoring from a machine learning model.

Our hypothesis is: If we enable communities to automatically prevent or revert obvious vandalism, moderators will have more time to spend on other activities.

In simpler terms, we're looking to build software which performs a similar function to anti-vandalism bots such as ClueBot NG, SeroBOT, and Dexbot, but make this available to all language communities.

We will be researching and exploring this idea during the rest of 2023, and expect to be able to start engineering work by the start of the 2024 calendar year.

Goals

 * Reduce moderation backlogs by preventing bad edits from entering patroller queues.
 * Give moderators confidence that automoderation is reliable and is not producing significant false positives.
 * Ensure that editors caught in a false positive have clear avenues to flag the error / have their edit reinstated.


 * Are there other goals we should consider?

Model
This project will leverage the new revert risk models developed by the Wikimedia Foundation Research team. [model card links?]

These models can calculate a score for every revision denoting the likelihood that the edit should be reverted. We envision providing communities with a way to set a threshold for this score, above which edits would be automatically prevented or reverted.


 * Do you have any concerns about these models?
 * What percentage of false positive reverts would be the maximum you or your community would accept?

Potential solution
While the exact form of this project is still being explored, the following are some feature ideas we are considering, beyond the basics of preventing or reverting edits which meeting a revert risk threshold.

Testing
If communities have options for how strict they want the automoderator to be, we need to provide a way to test those thresholds in advance. This could look like AbuseFilter’s testing functionality, whereby recent edits can be checked against the tool to understand which edits would have been reverted at a particular threshold.


 * How important is this kind of testing functionality for you? Are there any testing features you would find particularly useful?

Community configuration
A core aspect of this project will be to give moderators clear configuration options for setting up the automoderator and customising it to their community’s needs. Rather than simply reverting all edits meeting a threshold, we could, for example, provide filters for not operating on editors with certain user groups, or avoiding certain pages.


 * What configuration options do you think you would need before using this software?
 * Who should be able to configure the automoderator?

False positives

 * [New user process]
 * [Improving the model]

Other open questions

 * How could we provide information and a clear path for editors on the receiving end of a false positive to have their edit reinstated, in a way which isn’t abused by vandals?
 * If your community uses a volunteer-maintained anti-vandalism bot, what has your experience of that bot been? How would you feel if it stopped working?
 * What data for this tool should we track so that we can evaluate how successful it is?