Talk:Requests for comment/Content model storage

About this board

phab:T105652 was approved by ArchCom on 2015-07-29, , but we didn't have an implementation plan, so in August 2016, we're revisiting it (as discussed in Phab:E261). The choices in front of us:

a. Implement this RFC as originally approved

b. Implement a modified proposal fully described by T142980, with one extra table but no new columns, to also cater to the needs of multi content revisions (T107595).

c. Remain with the status quo until an alternative proposal is developed

Start a new topic

page_content_model_id and multiple formats

6 comments • 23:37, 15 August 2016 7 years ago

6

PleaseStand (talkcontribs)

Currently, there is no page_content_format field, and "Each row in the content_model table will refer to a unique tuple of (content_model, content_format)." If a page's content model allows multiple formats, which model ID would be used? Perhaps there should be separate content_model and content_format tables.

Reply 00:29, 14 July 2015 8 years ago

Legoktm (talkcontribs)

Yeah, that's a good idea. I talked with @Duesentrieb briefly and he liked the idea of having two tables. Will update the RfC shortly.

Reply 18:30, 22 July 2015 8 years ago

Duesentrieb (talkcontribs)

Separate content_model and content_format tables are much better, thanks.

Perhaps it would also be useful to put the default format for each model into the content_model table, as cm_default_format. This would be useful during migration, for filling in NULL values in fields like rev_content_format.

Reply 14:18, 5 August 2016 7 years ago

Legoktm (talkcontribs)

Maybe I'm misunderstanding what you meant, but wouldn't we fill in NULL with whatever ContentHandler::getDefaultFormat() returns? Why does it need to be in the database table?

Reply 10:57, 12 August 2016 7 years ago

Duesentrieb (talkcontribs)

Having it in the database would allow us to fill in the blanks without leaving the database. It's an option to consider, which may come in handy for some use cases. But you may very well be right that we don't need it.

Reply 19:30, 13 August 2016 7 years ago

Legoktm (talkcontribs)

I would rather keep it in PHP, mostly because you already need PHP code to "fill in the blanks" for things like namespace names. In any case, I'd like to leave it out for now because I don't think we should have the default specified in two places (PHP and database). Plus one of the goals of this RfC is to make changing the defaults easier, and this would somewhat work against that.

Reply 23:37, 15 August 2016 7 years ago

Reply to "page_content_model_id and multiple formats"

Separate table for content meta-data

2 comments • 11:06, 15 August 2016 7 years ago

2

Duesentrieb (talkcontribs)

Instead of adding columns to various tables (page, revision, archive), I suggest to create a separate table that holds meta-data about revision content (at least the model and format, but we will also want things like the role/slot, blob address, and hash there later, for multi-content-revision support, see Phab:T107595). The table would have at least these fields:

CREATE TABLE /*_*/content (
  cnt_revision INT NOT NULL,
  cnt_model INT NOT NULL,
  cnt_format INT NOT NULL,
  PRIMARY KEY (cnt_revision)
) /*$wgDBTableOptions*/;

This table can then be used to acquire the model and format for a given revision by joining cnt_revision against page_current, rev_id, or ar_rev_id.

If we want to support multiple content "slots" per revision (as per Phab:T107595), cnt_revision would no longer be sufficient to identify the desired content. A cnt_role field would be added to identify the role the content plays in the revision (e.g. main, style, categories, meta, blame, etc). cnt_role would reference a content_role table defined in the same way as content_model and content_format. cnt_revision and cnt_role form a unique key. The table would then look like this:

CREATE TABLE /*_*/content (
  cnt_revision INT NOT NULL,
  cnt_role INT NOT NULL,
  cnt_model INT NOT NULL,
  cnt_format INT NOT NULL,
  -- more fields to add for multi-content-revision support:
  -- cnt_address, cnt_hash, cnt_logical_size, cnt_is_primary, etc
  PRIMARY KEY (cnt_revision, cnt_role)
) /*$wgDBTableOptions*/;

CREATE TABLE /*_*/content_role (
  cr_id smallint NOT NULL PRIMARY KEY AUTO_INCREMENT,
  cr_role VARBINARY(32) NOT NULL
) /*$wgDBTableOptions*/;

CREATE INDEX /*i*/cr_role ON /*_*/content_role (cr_role);

When joining against page_current, rev_id, etc., cnt_role will then have to be fixed (e.g. to the "main" role) to allow a unique match per revision.

Reply Edited 11:06, 15 August 2016 7 years ago

Legoktm (talkcontribs)

My only concern with splitting to another table is making sure that the tables stay in sync, and if they get out of sync, handling failure gracefully. While it mostly should work properly with our transaction handling, we occasionally have random freak accidents where page_latest isn't updated properly or something, so it's not 100% perfect.Using the same table (revision/archive) would neatly avoid that problem.

Reply 10:53, 12 August 2016 7 years ago

Reply to "Separate table for content meta-data"

Table naming

2 comments • 13:27, 13 July 2015 8 years ago

2

PleaseStand (talkcontribs)

Manual:Coding conventions/Database#Table naming says, "Table names should be singular nouns: user, page, revision, etc." So shouldn't the new table's name be content_model, not content_models, in order to follow this convention?

09:49, 13 July 2015 8 years ago

Legoktm (talkcontribs)

I wasn't aware of that, updated. Thanks!

10:28, 13 July 2015 8 years ago

Database table or PHP array

3 comments • 06:45, 13 July 2015 8 years ago

3

Florianschmidtwelzow (talkcontribs)

Hi legoktm,

I think this proposal makes sense and should be done. I have one (maybe dumb) qiestion: The mapping of conten model ids and formats sounds like it could be done as an php array in a global var, too. Maybe I don't see any point you considered, but wouldn't it be possible to write this as a global array instead of a new database table, and if yes, why we shouldn't do that?

Thanks for your answer!

10:48, 12 July 2015 8 years ago

Legoktm (talkcontribs)

It could. But then we run into the same probably we currently have with namespaces. Each extension randomly picks a number it uses for namespaces which end up conflicting.

Setting it as a global would also encourage usage by developers, which we don't want. They should not be aware of what the id is, and shouldn't care either.

This proposal is similar to what we currently do with change tags. Everywhere in code (except for the ChangeTags class) we refer to tags by their string name, but in the database they are stored as integers, and there is a table that maps the integer to the string name.

21:49, 12 July 2015 8 years ago

Florianschmidtwelzow (talkcontribs)

Ok, yeah, that makes sense. Thanks for your explanation! :)

06:45, 13 July 2015 8 years ago

Alternative proposal - Use ENUM instead of a new table

2 comments • 02:30, 13 July 2015 8 years ago

2

Wp mirror (talkcontribs)

I added a new section at the bottom of your RFC. It contains statistics from which I conclude that using ENUM would be a good alternative.

Reply 01:09, 13 July 2015 8 years ago

Legoktm (talkcontribs)

Hi,

Using an enum is not feasible. First off, the page/revision/archive tables are in MediaWiki core, and should not contain any references to extensions (Flow, Wikibase, Scribunto, etc.). Currently any extension can define arbitrary content models and formats, like MassMessage's MassMessageList content type (example).

Also, using an enum also means we need to do a schema change any time we wish to add a new content model or format.

Finally, I would have preferred discussing your alternative proposal on the talk page before we added it to the RfC. I've removed it from the page now because it doesn't fit our requirements.

Reply 02:30, 13 July 2015 8 years ago

Reply to "Alternative proposal - Use ENUM instead of a new table"

There are no older topics