Bitfields for rev deleted

Requirements
We sometimes need to remove specific revisions from public view:
 * Copyright infringement inserted into histories
 * Libel/etc inserted into histories
 * People put their own personal information in by mistake (name, IP address, etc)
 * Vandals put other people's personal information (phone numbers, full names, addresses) in maliciously

Additionally:
 * Often we need only suppress the content; continuing to show the comment and username is useful in providing context for those viewing the history later.
 * However, sometimes material that needs to be removed is in the comment or user_text fields.
 * With a large field of admins, we may also need to suppress material from access by admins as well as the broad public.

Currently, the Oversight tool is widely used for this purpose, but selective deletion and restoration of pages may sometimes be utilised.

Bitfield values for rev_deleted
Current code contains some limited support for indicating that revisions should be hidden if rev_deleted is set. This field is already present in the database, and is an 8-bit TINYINT.

This field size is suitable for a small bitfield to provide slightly more options than the simple boolean on/off originally envisioned.

Proposed values:


 * 0 - normal, all-visible
 * 1 - content visible only to admins
 * 2 - summary visible only to admins
 * 4 - username visible only to admins
 * 8 - steward/oversight-only: regular admins can't view or undelete either


 * What about restricting anons and non-autoconfirmeds, too? Rob Church (talk) 01:21, 12 April 2006 (UTC)
 * I can't see the usefulness for this. Data should be hidden in case of copyright infringement/personal infos/etc. There's no reason to restrict these data just to anonymous or autoconfirmed users. --84.221.209.77 14:57, 3 December 2006 (UTC)
 * Well, Gmaxwell liked the idea on this very talk page. Titoxd (?!?) 04:23, 18 January 2007 (UTC)
 * I see a problem with the "steward" part : unless SUL happen before that, this will be limited to *local* steward (which don't exists), or the script would have somehow to fetch the right at meta.
 * Darkoneko 14:56, 3 October 2007 (UTC)
 * Presumably "stewards" actually means the user class(es) that are/is defined as having access to revision deletion—for WMF purposes, this means Oversighters. Anthøny 00:02, 9 April 2008 (UTC)
 * Correct. Aaron 02:27, 10 April 2008 (UTC)


 * A canonical example for logged in only: Someone vandalizes an article on a famous person on EnWP to claim he is dead. The next revision reverts, but the trouble maker starts forwarding around the permlink.  The general public is too clueless to notice the big "THIS IS AN OLD REVISION" notice at the top.  So currently the site is forced to delete that revision to prevent the confusion. Meanwhile people can argue that Wikipedia is covering up the existence of vandalism because of this. ... and many users who would not be confused by the revision lose the ability to see it. Researchers lose the ability to study it, etc.  Rev deleting to logged in only would address the issue without the negative side effects.  --Gmaxwell 09:14, 16 June 2008 (UTC)
 * As long as it's only the text of the revision that's deleted, it's not so bad. Researchers and people with a clue can be given the "deletedhistory" right, which can already be separated from adminship. (Not that I foresee the Wikipedia community allowing such a thing, but that's not a technical problem, it's a social one.)  Anthony 18:40, 15 September 2008 (UTC)

Using the entire bit field limits options for future expansion, and seems to allow a lot of options that aren't useful. Does it really make sense to hide the username and keep the text?. It may be better to either enumerate the options we want to support, or at least split the 8 bits up, and have values for sections; i.e.
 * $bit&3 = what's hidden
 * 0 = nothing
 * 1 = text
 * 2 = text + summary
 * 3 = entire revison
 * ($bit>>2) & 3 = who can see it (i.e. current en wp groups)
 * 0 = autoconfirmed
 * 1 = rollback
 * 2 = admin
 * 3 = steward

That uses half the bits, supports more functionality, and has fewer useless combinations. -Steve Sanbeg 18:41, 6 August 2008 (UTC)


 * "Does it really make sense to hide the username and keep the text?" - Would have been a better solution than full oversight in the case of SlimVirgin's oversighted contributions. Anthony 18:27, 15 September 2008 (UTC)
 * Is there no way to add another bitfield later? Nathan Larson 07:32, 3 November 2008 (UTC)

Export issues
The export format may require modification to deal with this properly; marked revisions should be included, with their unwanted parts excised and marked as such.



Database dump issues
See download.wikimedia.org

Oldimage and Logging tables can no longer be entirely public.

Code changes, secure by default?
Some Revision getter methods will now return bogus empty data if there are deletion markings for that field:
 * getUser, getUserText, getComment, getText

Those fields can be retrieved in all cases using a new 'raw' getter:
 * getRawUser, getRawUserText, getRawComment, getRawText

I've set it up this way as a more secure default: calling functions that don't know to check for permissions will have the restricted data hidden from them. Currently the main violators of such an arrangement will be things that read data directly out of the recentchanges table, or from the revision table without using the Revision wrapper class.

isDeleted now takes a bitfield constant, so you can ask it which fields to check against.

A new Revision method, userCan, also takes a bitfield and checks if $wgUser is of sufficient privilege to access the given field(s).

Linker now has some convenience methods which take a revision parameter, do permission checks and set appropriate formatting:
 * revComment (wraps commentBlock)
 * revUserLink (wraps makeUserLink, another new method for making basic userpage-or-contribs links)

Methods to be sure are safe:
 * Page view
 * Page oldid edit
 * Transclusion
 * Edit rollback
 * Undo
 * Null edit stuff (like moves/protects)
 * Special:undelete
 * Recentchanges
 * RSS feeds
 * Object caching (diffs)

Changes to logging table
It would be useful to extend this to logs as well. This requires a log_deleted column. Often, vandals with inappropriate names are blocked or they create pages with such names which are deleted. These contents appear again in logs and recent changes.

This obviously requires a schema change, but allows for the same methods of hiding to be used everywhere. However, speciallog will need permission checks sprinkled everywhere, do-able, but annoying.

Changes to recentchanges table
When an revision or log event is partially hidden, the recentchanges entry must also be dealt with. A new column, rc_deleted, was added. This allows for the same methods of hiding to be used everywhere. However, changeslist will need permission checks sprinkled everywhere, do-able, but annoying.

Changes to oldimage table
The addition of oi_deleted will be useful to futher suppress deleted images. The oi_sha1 key can store the file key across deletion.

Changes to filearchive table
The addition of fa_deleted will be useful to futher suppress deleted images. In the future, when the FileStore conversion is made, all images will use this table, just with different directory grouping ("old" as well as "deleted"), which will allow for hiding of images without having to delete them first as well as having the fa_deleted flag persist upon restoration

Innappropriate usernames
Adding an option to special:blockip to hide the name from public areas (block list, userlist and block log) is also needed for libel or personal info containing usernames. This can be accomplished by adding ipb_deleted to the block table and adding a JOIN to the listuser query.

Interaction with current deletion system
If the archive system is to be kept, the addition of ar_deleted is required in order to maintain the visibility bitfield for revisions, and more importantly to secure revisions "hidded from other Sysops". Otherwise, deleting pages causes this to fall off. This requires some permission checks in specialundelete.

Revision deletion is useful for removing severely inappropriate comments/names from revisions or hiding the text without causing misleading diffs or attribution issues. The current deletion system will still be useful for elimating entire pages, red linking them and possible maintainence db sweeps to delete out old items to save space.

Interface