Product Safety and Integrity/Detecting abusive content
|
Detecting abusive content
Use AI models trained on Wikipedia policies to detect bad-faith editing
|
As part of our anti-abuse strategy, we are exploring the use of artificial intelligence models tuned to follow community policies, to detect bad-faith editing. This can include doxxing, harassment, spam, and other kinds of vandalism, and would complement community processes for detecting and handling such edits.
There has been significant investment in AI models honed for policy enforcement around abusive content, including open models like gpt-oss-safeguard (which we are evaluating). These models allow us to develop a generic model that is guided by a plain language policy, that outlines what the community means by harassment, or Wikimedia-specific policies like "silly vandalism", and to have the AI model apply these community definitions when examining an edit.
Beyond centrally-managed detection, there is also the potential to include these signals in community-managed systems like AbuseFilter. For example, if a long-term abuser is known to create certain types of edits, these could be encoded into a policy whose results could then be made available for use in a filter, instead of needing to continually readjust the specific variables and expressions in a filter. (See the gpt-oss-safeguard example policy prompts for additional examples.) Various use cases could use the same underlying model, which makes this scalable for our infrastructure.
First project: detecting content that should be suppressed
[edit]In May and June 2026, we are focusing on identifying content that should be suppressed on English Wikipedia. For example: edits that intentionally or unintentionally add personally identifying information, like addresses, phone numbers, or email addresses.
We are building a dataset and evaluation framework that will help the Wikimedia Foundation and community members assess different models and content policy texts.
Our goal is to provide a recommendation of a model, or combination of models, that would allow for various use cases:
- warning users via an Edit Check that they may be adding content that should be suppressed, to reduce the frequency of unintentional additions
- notifying users with extended rights about content that very likely should be suppressed, to reduce the time-to-live of that content
- automatically suppressing some kinds of content, if we can arrive at a high enough precision to justify this approach.
We chose suppressed content as our starting point because it's well structured in on-wiki logs, and is high-risk content that would benefit from improved efficiency in detection and mitigation.
Timeline
[edit]- -
- Establish a dataset of high-certainty content that should and should not be suppressed
- -
- Tune and evaluate different models and combinations of models, and policy texts
- Recommend the best model or combination of models with thresholds for various use cases (flagging content, use in an Edit Check, and automatic suppression)
Contact
[edit]FAQ
[edit]How is this different from revert risk or ORES models?
Anti-abuse AI models allow us to craft policy text for the type of policy violation we are interested in detecting. Revert risk and quality and intent models are valuable signals, but they are not adaptable to assessing whether a given revision adds, for example, content that should be suppressed, or to provide a reason as to why a revision is likely to be reverted. Anti-abuse AI models are likely to provide communities with additional precision for use in automated and manual workflows.
Will you be automatically deleting some bad-faith activity, or is this only about flagging it?
In the first phase, during the evaluation, we want to optimize for two distinct use cases.
The first use case is to identify a level of precision that is highly accurate, but has some false positives, and would provide value for users with extended rights to manually review and take action on.
One other case is to identify a configuration that allows for high enough precision that communities would feel confident in automatic reverts or preventions of saves. This level of precision would likely miss many borderline cases.
Are you planning on expanding this on more wikis?
We are evaluating whether AI models like this can achieve sufficient precision to be viable at real-world abuse detection on the wikis. We intend to focus this on language-specific models for now, to optimize for precision. If this is viable, we would expect to evaluate the same questions for additional languages.