Revision refactor

From mediawiki.org

Background[edit]

Todo: copy over relevant core of info from User:Brooke Vibber/Compacting the revision table round 2

TL;DR:

  • major refactor of 'revision' table and some other core tables
  • taking some wide string fields & indexes out of 'revision' table to make it easier to work with (comments, user/IP actors, content models/formats)
  • taking some string fields out of 'revision' table to make them reusable in other places like logging (comments, user/IP actors)
  • prepping schema to allow multiple content objects with distinct roles per revision ("multi-content revisions")

Work plan[edit]

Current rough work plan for the revision refactor; this will be updated with phab links:

Finish up schema redesign work:

  • update comment table per last week’s consensus
  • decide on keeping/killing content_format
  • decide on keeping/killing content & rev sha1 hashes
  • decide on slot role being in slots vs being in content (feels cleaner to keep it in slots, and relatively low cost)
  • double-check indexes on actor
  • use actor, comment tables for logging and images
  • update, cleanup stray bits

Finish up the proof-of-concept SQL updater:

  • update the updater patch sql with the above updates

Start on real updater & transition:

  • describe updater in doc in more detail
  • create a schema batch-updater class that can update revision, logging, etc rows in-place during transition mode
  • split the patch sql into two pieces (one that adds new tables/fields, one that removes old fields)
  • create a proper installer/updater module that uses the batch-updater class for the middle part between the start and finish sql patches
    • this will reduce the separation of code paths for small-site/3rd-party and large-site/wikimedia conversions

Globals:

  • create a global config var for the transition state (old, transition, new)

Updating the Revision class:

  • in constructor, accept initialized data from actor, comment table columns when available
  • lazy-load actor_text, comment_text, content as necessary
  • add new columns/tables to the various internals-exposed APIs like which columns need to be fetched for a manual Revision lookup (depending on transition switch)
  • join and fetch those columns when available (depending on transition switch)
  • insert those new columns & tables when available (depending on transition switch)
  • start looking at how to build a new, MCR-friendly, future-friendly API for fetching, storing, and querying revisions

Updating the Logging class:

  • todo: investigate in more detail what needs fixing
  • update to work with actor & comment tables
    • with lazy-loading
    • depending on transition switch

Updating page deletion:

  • todo: poke around in the non-Revision bits of page deletion to handle the new schemas

Updating recent changes:

  • put this off for now, use the existing summary table

Other things to prep:

  • did we need reversion info in revision? thinking a separate tracking table is best.

Updating xml import:

  • either prep this for MCR or just handle the single content items for now
  • update Revision API usage if necessary

Updating xml export:

  • either prep this for MCR or just start thinking about it for later

Updating editing:

  • can continue to use high-level article edit API for now
  • start thinking about this for MCR though

Updating other core:

  • audit other internal code that touches revision
  • check API modules that expose revision queries etc, need to update
    • Looking at core modules that directly query the 'revision' table:
      • ApiQueryRevisionsBase and its 4 subclasses will need significant work.
      • ApiQueryContributors will need minor query adjustments to use rev_actor
      • ApiQueryRecentChanges might need to drop or replace rcprop=sha1
      • ApiQueryUserContributions will need updates for rev_actor and rev_comment, some thought how to handle ucprop=size|sizediff with the loss of rev_len.
  • check Special pages that expose revision queries etc, need to update
  • check maintenance scripts that expose revision queries etc, need to update
  • set up a todo list and smash em all down…

Updating extensions:

  • audit extension repos for direct revision table usage, see how much fun this will be
  • set up a todo list and prioritize them
  • consider an extension.json compatibility check


Revision API cleanliness thoughts:

  • Revision should be mostly ‘dumb’ object, with services to fetch and store
  • replace various Revision static getters with service that takes a $db
  • the new Revision([]) && $rev->insertOn($db) pattern is godawful
    • replace it with a store interface that applies modifications on a prior revision or emptiness
  • replace text-related fetchers & compressors/decompressors with a good interface to Content fetching
  • the revision deletion/visibility features lead to odd APIs for fetching metadata that accept an audience and user param
    • consider changing these to one interface that requests a view object for a given user-or-public-or-raw, then just use that view object

Content API thoughts:

  • keep em clean. either do something very simple now that can be extended for MCR full world, or think about it before exposing a public api

Comment API thoughts:

  • if we’re going to reuse comments in multiple places, and give them optional data params, then encapsulating sounds wise.
  • consider adapting or replacing the machine-readable metadata from logging?
  • feed a comment object with its context into the comment-rendering functions
    • context is page, target page, actor, potentially other things like target section and logging params :D

Actor API thoughts:

  • need a consistent way to pass around Actor info refs too maybe?