Requests for comment/Page deletion

From mediawiki.org
Request for comment (RFC)
Page deletion
Component General
Creation date
Author(s) Platonides
Document status in draft

This page serves as ground for discussion on the best way to integrate RevDelete with normal page deletion. Wikitech thread: [1] [2] [3]

Particularly, moving page deletion to a RevDelete mechanism would solve:

  • Bugzilla:21279 RevDelete + Normal deletion break log links.
  • Bugzilla:18104 Deleted revisions should be identified by id, not timestamp.
  • Easy comparison between deleted revisions and existing ones.
  • It is really expensive to delete pages with lots of revisions.
  • Page protection can be kept on delete + undelete cycles.
  • Groups of revisions that are deleted are merged with all other deleted revisions for that page making them indistinguishable from each other.
  • page_id changes when a page is deleted and then restored.

History[edit]

Aaron's patch had the following fields:

  • deleted_namespace
  • deleted_page_title
  • deleted_page_id
  • deleted_page_suppressed
  • deleted_on_timestamp

Current status[edit]

At Architecture meetings/RFC review 2013-11-20, ^d asked "what the schema would look like if a 1.5-style complete refactor happened...throw out all assumptions and do things the Right Way." Tim replied, "it would probably be a field but a table has some advantages for b/c". Aaron noted that he "actually coded the table option ages ago in branch not that any of that would be usable now". See bugzilla:11402.

The next step is "Leucosticte to expand on 'new table' design, optionally start on prototype . . . present a single option (a table), and maybe give a few more details about how that will work Then we'll accept it . . . maybe just give some indicative SQL showing the kind of query Special:Contributions and Special:DeletedContributions would do . . . obviously a change to deletion is needed, we're not going to reject the whole concept . . . it's just a matter of getting the right level of detail before coding starts, so that you don't end up wasting too much time on back-and-forth in code review . . . you want to show the Special:Undelete SQL smaple in the rfc too . . . and don't make it a separate page like I did in that patch ;)". Special:Contributions and Special:DeletedContributions might share more code.

Comparison[edit]

New table New field
+ Easier upgrade. Doesn't require a schema change on page. - Harder upgrade. Requires a schema change on page
With support for deleted pages:
- Requires a way to merge revisions from different pages.
- How to view/compare revisions from different page sets?
+ name_title index is not polluted with many useless entries - name_title index is polluted with many useless entries
+ Backward compatible with (most) third party programs / old extensions, depending on how/whether they interact with the archive table - Bigger exposure of deleted data from third party programs / old extensions
+ Existing queries would not need to be modified to add the "is not deleted" condition - All existing queries would need to be modified to add the "is not deleted" condition
+ Does not compromise the uniqueness of (page_id, page_title) + All page_ids live in one table.
- MediaWiki user needs mysql DELETE right on page + MediaWiki user doesn't need mysql DELETE right on page
- Creates yet another table + Doesn't create yet another table

Proposed solutions[edit]

archived_page table[edit]

Use a table similar to page which holds page content when it is deleted.

Deleting a page is moving the page entry to a new entry in the archived_page table.

We could have different deleted pages: groups of revisions which once existed at the same history. So revisions are no longer mixed. Each deleted page has a different page_id (although they share namespace and title). If page "foo" gets created (page_id 1), then deleted, then recreated (page_id 2), then deleted, then recreated (page_id 3), the delete_page table would have two different rows (dp_id 2 and 3) for the two deleted revision histories. In such case, page creation produces a new page_id.

Otherwise, page creation recovers the entry from archived_page. (What do you mean by "recovers the entry"? Would it then have the same page ID as before? According to Aaron Schulz and Tim Starling, yes)

PRO: page JOIN already done in core. Also, most places want to join to get the page for other reasons anyway, so this has some "secure by default" nature to it.

page_deleted field[edit]

Add a page_deleted field to the table page. The page is marked on the page_deleted bitfield as no longer existing. Another option would be to define that page doesn't exist if it has no visible [latest] edit (page_latest=0?). (What would be more efficient?)

Possibly add a rev_logid field to hold the log_id of the deletion event. At any rate, log_params or another logging table field should probably hold, in serialized form, all the rev_ids of revisions deleted and restored in a particular log action so that these actions can be easily reversed.

Page creation reuses the same page entry if it exists.

CON: Unique key conflicts with live pages and deleted pages with the same name (Is that a bug or a feature? Why should it be considered a "conflict" rather than something potentially useful? Isn't the conflict resolved by having a different ar_logid for revisions deleted in different page deletion actions?)
Currently name_title is a unique index on (page_namespace,page_title), it would need to be expanded to something like non-unique (page_namespace,page_title, page_deleted) [there may be several deleted instances of the same page].
CON: We will need to add "rev_page = page_id AND page_deleted = 0" checks in revision queries.
CON: What happens to the old page entry? Moved to another table? Or stored in serialized form in a field? Is there even any data on that row that needs to be stored, or is this whole "CON" moot?
It is kept in the page table (moving to another table is the previous proposal). Only page_id and page_content_model would be required, but others are interesting to keep, too: page_restrictions, page_counter, page_latest.

Bugzilla[edit]

  • bugzilla:55398 — Move page deletion to a RevDelete mechanism; kill archive table (fire optional)
  • bugzilla:11402 — Deleting of pages with high number of revisions makes server cry (contains an outdated patch but can be looked at for inspiration)