User:Brion VIBBER/Compacting the revision table round 2

Overview
Per ongoing discussion in ArchCom and at WikiDev17 about performance, future requirements, and future-proofing for table size it's proposed to do a major overhaul of the  table, combining the following improvements:


 * Normalization of frequently duplicated data to separate tables, reducing the dupe strings to integer keys
 * Separation of content-specific from general-revision metadata to support:
 * Multi-content revisions allowing for storing of multiple content blobs per revision -- not related to compaction, but of great interest for structured data additions planned for multimedia and articles
 * general reduction in  table width / on-disk size will make schema changes easier in future
 * trying to avoid inconsistencies in live index deployments
 * ideally all indexes should fit on all servers, making it easier to switch database backend around in production

cf other & older notes:
 * User:Brion VIBBER/Compacting the revision table
 * Multi-Content_Revisions/Content_Meta-Data
 * Wikimedia Developer Summit/2017/Scaling the Wikimedia database schema

Provisional
/tables.sql (needs updating)

TODO: update archive table to match revision

TODO: maybe switch text table around and change the text_id ref to a URL? (In progress of thought)

TODO: plan updates to other tables that have the user_text pattern.

Thoughts

 * That seems like a lot of tables!
 * Most of them are the small tables for inlining strings -- content models, content formats, content slot roles for MCR, and user refs/IP addresses for user_entry. These should save a fair chunk of duplicated space. Additionally the MCR split between revision & content makes each of the two tables smaller and more malleable.


 * What happened to rev_text_id?
 * content.cont_address replaces it.
 * It may be an open question whether we want to make that change immediately, or whether to change the 'text' table as well, etc.


 * Why isn't rev_deleted moved to content?
 * rev_deleted is a bitfield and most of its options apply to things that aren't part of a Content object, such as the edit comment and the username. If separately "rev-deleting" just one content item is needed, a second bitfield or flag will need to be added on the content table too...


 * What about rev_len, rev_sha1 -- do they belong in content?
 * Not sure about this. Do we need to keep the fields for summing from multiple content objects?


 * How hard will it be to change queries?
 * Those that WHERE on rev_user/rev_user_text directly, or read fields directly, etc will need to be updated. :(
 * Things that just use Revision::*FromConds and the accessor functions will be able to fall back to lazy loading without
 * Stuff that touches rev_text_id directly will need changing.
 * Stuff that wants to pre-join a bunch of data may need changing. May be able to add abstractions on Revision function to hide some of that, or build new abstractions that are less freaky.
 * Stuff that inserts manually will need updating; stuff that uses Revision accessors should remain safe, but should update to MCR-specific interfaces in future


 * What would a transition look like? What kind of background processing and what kind of downtime to expect?
 * We'll need a transitionary mode in MediaWiki where a background process runs filling out the new tables. This may take some time -- several weeks sounds likely for the biggest sites. This may increase load on servers and will require additional disk space.
 * Most likely we will first add the new fields to revision, allowing them to be filled out without disturbing ongoing actions, and then remove the now redundant fields afterwards.
 * In principle, once the background process is complete, it should be possible to switch a wiki to read-only, flip its mode, and then switch back to read-write with little downtime for editors.
 * This could also allow quickly aborting/reverting to the previous state of revisions if things look bad in read-only... but if a rollback after going read-write is desired, that's a lot freakier to deal with.


 * What about manual queries (tool labs, etc) that might not want to update?
 * In principle a view can be created that replicates the classic view of the revision table... except for that text_id maybe.


 * Space and performance concerns about indexes; possibility of "sharding" different sub-tables with specific indexes?
 * Could consider extra tables to support usage patterns, as long as they all fit on standard servers.

Compatibility view
Something like this could work for providing a back-compatible view of revision:

Main pain points are rev_text_id, if we change it to a full address in content it's harder to just use a reference. And whether to explicitly pull a main content slot or what...

Denormalization
Part of the space usage of the revision table is the several indexes on it:

(Question: are there any extra indexes added in production or tools that might be needed too?)

There might or might not be benefit to creating denormalized summary tables containing only a few keys rows and then the indexes needed, like this:

This can then be joined to the main tables to get the rest of the information. Does this make sense? We're not sure... yet :)

Questions and comments
Use the talk page pls