These are thoughts about potential scalability problems with JADE, an extension that allows editors to submit editorial judgments about wiki pages, revisions, and diffs.
For basic context and background
For implementation details
Can we save space by combining all judgments about a page into a single JADE page?
This is tempting at first sight, but:
- JADE page size will grow with the number of revisions in the page. Revisions are saved wholesale in content storage. Pages with multiple judgements will be edited more. Therefore, content size, traffic and storage will be substantially higher for these records and much of it will be wasted in the common case of working with judgments on a single revision.
- We save on the number of pages created but there is no reduction in the number of revisions made.
- This trick only works to compress revisions, and doesn't have an easy analogue for other entity types being judged.
- Storing multiple, unrelated items in one page is an antipattern. Two revisions on a page are only indirectly related.
- Conflicts are more likely as we make the pages longer.
- UI will need to be developed to keep things straightforward for end-users and tool developers.
How will we control JADE's growth?
There are several approaches we will take.
1. Staged deployment to various integration points (Huggle, New Pages Patrol, etc.)
First of all, we're conducting user testing with the help of Daisy Chen and Prateek Saxena in order to validate and explore potential use cases. Then, we're planning to integrate with existing workflows by storing judgments that are already being made into our JADE storage. The workflows we're looking at are currently New Pages Patrol, Recent Changes Patrol, FlaggedRevs, and PageTriage.
For each of these workflows, we'll be able to set up integration that can be controlled by configuration, per-wiki. We'll turn on a single workflow for a single wiki, analyze the impact, and iterate. During this transparent integration phase, we'll be able to revert any configuration as needed. In a later phase, we may introduce a dedicated JADE workflow on-wiki, or we may augment existing workflows to take advantage of JADE capabilities, for example collecting a freeform text comment when patrolling pages.
In the event that one of these integrations has to be disabled, the procedure will be to notify the community and deploy the configuration change to disable that integration. The effect will be that new data will stop flowing in from that workflow, which does not prevent the workflow from continuing with its legacy stores, but does have an impact on data consumers. During the first phase, integrated workflows with be "soft" coupled, so they gracefully degrade to not duplicate data into JADE, transparently to the end user. In the future, an outage will be more significant because software will rely on incoming data and might be unusable or useless without JADE.
2. Limit editing by user right
MediaWiki provides the functionality to limit edits to various namespaces by user right. On large Wikis "rollback" functionality is limited to those within the "rollbacker" or "sysop" user groups. Similarly, New Pages Feed only works for users in the "patrollers" group. We will limit editing of the JADE namespace to users in these groups to start off. Through working with these users, we'll learn their work patterns and decide when to open up contribution for less privileged users based on empirical data.
3. Bot supervision and blocking
Ordinary community mechanisms of managing bot privileges and behavior. Extension deployment will come with a phase of community discussion, in which we explain why bots are not welcome to spam this namespace with automated predictions. If the community disagrees on this point, we can pause deployment until a resolution.
In the event that a bot goes haywire and begins to add JADE content such as automated predictions, it can be blocked temporarily.
How much storage do we need?
Our limiting factor is the human labor time needed to create judgments. We can use the sum of all existing review workflow volumes as an upper bound, which is around 1% of total edits. This gives us a maximum of c. 500,000 annual JADE actions on a large wiki such as English Wikipedia, or 4,000,000 annual JADE actions over all wikis combined. We have fairly granular control over the rate of growth as explained above, so this volume won't be reached for several years. We're able to keep growth within whatever limits are dictated by the available resources.
Is this scaling with revisions?
Mathematically, it's possible to create a judgment for every revision, and even create loops where judgments are being judged themselves. Luckily, the human labor limit mentioned above will come into play long before we reach scaling proportional to revisions. The risks are the same as caused by the potential for editors to make multiple talk page edits for every content page edit—the capability is afforded by software when needed, but editors in aggregate don't have the energy to ever do such a thing.
The potential scenario of judging a judgment is also fine, an intended use case even. Since only 1% of edits are reviewed, we can assume that the same ratio will hold for judgments themselves, therefore out of every 10,000 judgments there will be 100 judgments of judgments, and 1 judgment of a judgment of a judgment. This is clearly a vanishing term as we go towards higher-order judgments.
Aren't we planning to record bot judgments?
We're focuing on human judgment. Bot or other types of heuristic judgments are much lower quality and not particularly interesting for our purposes.
What about rogue bots?
One of the main concerns that has been raised is that we're creating a potential crime of opportunity: a vast new namespace which invites massive, meaningless contributions where every new revision gets judged by bots.
The plan is that rogue editing that goes against suggestions will be flagged for blocking. Normal rate-limiting should limit the damage until stronger measures can be taken.
Bots and their activities are highly regulated -- especially in big Wikis. When they run out of control, the event is short lived. See https://commons.wikimedia.org/wiki/File:ORES_-_Facilitating_re-mediation_of_Wikipedia%27s_socio-technical_problems.pdf for a comprehensive review.
How are schema migrations handled?
The JSON content schema for the Judgment namespace, the public API and the PHP API will all be stable interfaces. We'll follow https://www.wikidata.org/wiki/Wikidata:Stable_Interface_Policy, so on the occasions that we can't provide a transparent migration or permanent access using an old API, we'll notify stakeholders of the breaking change well in advance and will help migrate legacy usages and content. Unfortunately, we might need to take downtime for some of the more extreme migrations, in this scenario.
Another possibility is that breaking changes will be handled by introducing a new ContentModel, but it's not a very satisfactory thought.
What if we're wrong and need to migrate away from wiki pages entirely?
This is the scariest of the disaster scenarios, because wiki page storage is not meant to be deleted, for example. If we migrate away, we never get to reclaim the storage and leave a big mess behind us.
What is the long-term future of JADE storage?
We may decide to migrate to "structured storage" once it's mature, assuming it can support our requirements above. Our ideal is to have a native, structured store where analytical queries are able to access the judgment fields, but with all the affordances of wiki pages. Word on the street is that Marko Obrovac et al. are working on exactly this.
https://phabricator.wikimedia.org/T196547 - Prior discussion of scalability concerns
https://www.mediawiki.org/wiki/Extension:JADE - Current implementation