Requests for comment/image and oldimage tables

From MediaWiki.org
Jump to navigation Jump to search
Request for comment (RFC)
image and oldimage tables
Component Database
Creation date 2014-09-24
Author(s) Leucosticte, Krinkle
Document status accepted
--Krinkle 23:09, 6 February 2017 (UTC)
See Phabricator.
Database2014-09-24Leucosticte, KrinkleT589--Krinkle 23:09, 6 February 2017 (UTC)

Problems[edit]

  1. Database table image has no auto-incrementing primary key.
  2. Database table oldimage has no primary key, at all.
  3. File revisions as a concept (within the code and API) should have unique identifier. Right now we only have "current file title + upload timestamp". The lack of a unique id for file revisions leads to race conditions, makes the table hard to index and hard to query.
  4. Uploading file revisions should not involve moving rows across tables, or replacing entire rows.

Use cases[edit]

  • End-user improvements
    • T139294: Persistent media links for files - Right now we only have permalinks for non-current file revisions. The current file revision is only accessible by the canonical url for the file. This in contrast to page revisions, which always have a stable revision id.
    • Enable creation of secondary database tables that associate data with files and/or their revisions (e.g. T33257: File properties and License integration). This is not feasible with the current schema due to a lack of primary keys.
  • API Improvements
    • prop=imageinfo - Use a proper file revision ID instead of fragile timestamp as identifier.
  • Technical debt
    • LocalFile::recordUpload2() - Eliminate timestamp kludge.
    • Eliminate the need to query two tables when needing to find a file revision.
    • Eliminate the need to move rows across tables when uploading a file. This is an anti-pattern in relational databases and makes certain improvements to database stability and performance hard or impossible (as found at Wikimedia Foundation).
    • Stability: Having a separate file and filerevision table would solve data inconsistency issues. A request will resolve the pointer once and all other queries will fetch information about the same revision. It also makes it more attractive to query a replica instead of the master.

Possible solutions[edit]

1. Add primary keys[edit]

Rejected. This solution would be relatively easy to implement, but does not solve Problem 4.

2. Separate file and file revision[edit]

Advantages[edit]

  • Eliminates the legacy handling of needing to interact with two tables for most operations (image and oldimage).
  • Aligns the file tables with the design of the page and revision tables.
  • Creates proper semantic separation between file entities and the representation of their revisions.

Effective change[edit]

  • Add fields:
    • img_id: New primary key for image.
    • img_latest: Pointer to oi_id for the current revision of a file (similar to page.page_latest)
    • oi_id: Primary key of oldimage.
    • oi_img: Pointer to img_id of the file this revision corresponds to (similar to revision.rev_page)
  • Rename tables:
    • Rename image to file.
    • Rename oldimage to filerevision.
  • Add missing rows in filerevision for current revisions, based on rows from image.
  • Reduce fields in file to only those needed for current revisions only.
    • Currently indexed:  img_timestamp, img_user_text, img_sha1, img_media_type, img_major_mime, img_minor_mime, img_size.
    • img_timestamp: Remove. Used for contribution history and creation of archive name. Replaced by filerevision.
    • img_user_text: Remove. Used for contribution history, ApiQueryAllImages, SpecialMIMEsearch. Query from filerevision instead (using a join, if needed).
    • img_sha1: Keep. Used for duplication detection for current revisions.
    • img_media_type, img_major_mime, img_minor_mime: Keep. Used by Special:MIMESearch for current revisions. A separate initiative may start after this RFC is approved and implemented to consider removing this in favour of a SearchEngine-based approach, and/or to find a way to efficiently from filerevision instead.
    • img_size: Remove. Not indexed. Used by checkImages.php (error check) and Special:MediaStatistics (SUM query). Can be queried by joining against filerevision instead.
    • img_width, img_height: Remove. Not indexed. Only store in filerevision instead.
    • img_bits: Remove. Not indexed. Not queried. Only store in filerevision instead.
    • img_description: Remove. Not indexed. Only store in filerevision instead. Can be queried there if needed.
    • img_user: Remove. Only store in filerevision instead. Can be queried there if needed. Used by ApiQueryAllImages and NewFilesPager (SpecialNewimages) to join against user for filtering bots. Can join against filerevision instead. Though should probably use recentchanges.rc_bot instead.

Migration strategy[edit]

Exact schema migration script to be written as part of the implementation and fine-tuned as needed during code review. We may need two separate strategies due to the size of the migration (one for the default db updater, and an opt-in maintenance script for large farms such as Wikimedia's). A few ideas so far:

  • Tim Starling comment #2747454:
    • Rename image to filerevision. Create a view called image.
    • Add new fields to filerevision.
    • Create file table.
    • (Disable uploads.)
    • Populate file from filerevision.
    • Also move oldimage rows into filerevision. (Not be visible from the image view)
    • (Deploy new MediaWiki version that uses file/filerevision.)
    • (Re-enable uploads)
    • Drop image and oldimage.
  • Jaime Crespo mentioned the idea of potentially doing the migration offline while serving traffic from codfw instead of eqiad.

3. Replace with MCR[edit]

Migrate all file information to page and revision tables and migrate meta data and content references by using Multi-Content Revisions per T107595.

We could decide to stop using the image/file tables entirely, in favour of storing the information in the wiki page/revision tables, with file-backend references similar to what we do for ExternalStorage. We could keep "file" as a simplified secondary data table (similar to "categories").

See also[edit]

  • T589: [RFC] image and oldimage tables
  • T73198: Image and oldimage tables lack auto-incrementing primary keys.