Requests for comment/image and oldimage tables
|image and oldimage tables|
--Krinkle 23:09, 6 February 2017 (UTC)
- Database table image has no auto-incrementing primary key.
- Database table oldimage has no primary key, at all.
- File revisions as a concept (within the code and API) should have unique identifier. Right now we only have "current file title + upload timestamp". The lack of a unique id for file revisions leads to race conditions, makes the table hard to index and hard to query.
- Uploading file revisions should not involve moving rows across tables, or replacing entire rows.
- End-user improvements
- T139294: Persistent media links for files - Right now we only have permalinks for non-current file revisions. The current file revision is only accessible by the canonical url for the file. This in contrast to page revisions, which always have a stable revision id.
- Enable creation of secondary database tables that associate data with files and/or their revisions (e.g. T33257: File properties and License integration). This is not feasible with the current schema due to a lack of primary keys.
- API Improvements
- prop=imageinfo - Use a proper file revision ID instead of fragile timestamp as identifier.
- Technical debt
- LocalFile::recordUpload2() - Eliminate timestamp kludge.
- Eliminate the need to query two tables when needing to find a file revision.
- Eliminate the need to move rows across tables when uploading a file. This is an anti-pattern in relational databases and makes certain improvements to database stability and performance hard or impossible (as found at Wikimedia Foundation).
- Stability: Having a separate file and filerevision table would solve data inconsistency issues. A request will resolve the pointer once and all other queries will fetch information about the same revision. It also makes it more attractive to query a replica instead of the master.
1. Add primary keys
Rejected. This solution would be relatively easy to implement, but does not solve Problem 4.
2. Separate file and file revision
- Eliminates the legacy handling of needing to interact with two tables for most operations (image and oldimage).
- Aligns the file tables with the design of the page and revision tables.
- Creates proper semantic separation between file entities and the representation of their revisions.
- Add fields:
- Rename tables:
- Add missing rows in
filerevisionfor current revisions, based on rows from
- Reduce fields in
fileto only those needed for current revisions only.
- Currently indexed:
- img_timestamp: Remove. Used for contribution history and creation of archive name. Replaced by filerevision.
- img_user_text: Remove. Used for contribution history, ApiQueryAllImages, SpecialMIMEsearch. Query from
filerevisioninstead (using a join, if needed).
- img_sha1: Keep. Used for duplication detection for current revisions.
- img_media_type, img_major_mime, img_minor_mime: Keep. Used by Special:MIMESearch for current revisions. A separate initiative may start after this RFC is approved and implemented to consider removing this in favour of a SearchEngine-based approach, and/or to find a way to efficiently from
- img_size: Remove. Not indexed. Used by checkImages.php (error check) and Special:MediaStatistics (SUM query). Can be queried by joining against
- img_width, img_height: Remove. Not indexed. Only store in
- img_bits: Remove. Not indexed. Not queried. Only store in
- img_description: Remove. Not indexed. Only store in
filerevisioninstead. Can be queried there if needed.
- img_user: Remove. Only store in
filerevisioninstead. Can be queried there if needed. Used by ApiQueryAllImages and NewFilesPager (SpecialNewimages) to join against
userfor filtering bots. Can join against filerevision instead. Though should probably use recentchanges.rc_bot instead.
- Currently indexed:
Exact schema migration script to be written as part of the implementation and fine-tuned as needed during code review. We may need two separate strategies due to the size of the migration (one for the default db updater, and an opt-in maintenance script for large farms such as Wikimedia's). A few ideas so far:
- Tim Starling comment #2747454:
filerevision. Create a view called
- Add new fields to
- (Disable uploads.)
- Also move
filerevision. (Not be visible from the
- (Deploy new MediaWiki version that uses file/filerevision.)
- (Re-enable uploads)
- Drop image and oldimage.
- Jaime Crespo mentioned the idea of potentially doing the migration offline while serving traffic from codfw instead of eqiad.
3. Replace with MCR
Migrate all file information to page and revision tables and migrate meta data and content references by using Multi-Content Revisions per T107595.
We could decide to stop using the image/file tables entirely, in favour of storing the information in the wiki page/revision tables, with file-backend references similar to what we do for ExternalStorage. We could keep "file" as a simplified secondary data table (similar to "categories").