Talk:Requests for comment/Content model storage

About this board

phab:T105652 was approved by ArchCom on 2015-07-29, , but we didn't have an implementation plan, so in August 2016, we're revisiting it (as discussed in Phab:E261). The choices in front of us:

a. Implement this RFC as originally approved

b. Implement a modified proposal fully described by T142980, with one extra table but no new columns, to also cater to the needs of multi content revisions (T107595).

c. Remain with the status quo until an alternative proposal is developed

page_content_model_id and multiple formats

6
PleaseStand (talkcontribs)

Currently, there is no page_content_format field, and "Each row in the content_model table will refer to a unique tuple of (content_model, content_format)." If a page's content model allows multiple formats, which model ID would be used? Perhaps there should be separate content_model and content_format tables.

Legoktm (talkcontribs)

Yeah, that's a good idea. I talked with @Duesentrieb briefly and he liked the idea of having two tables. Will update the RfC shortly.

Duesentrieb (talkcontribs)

Separate content_model and content_format tables are much better, thanks.

Perhaps it would also be useful to put the default format for each model into the content_model table, as cm_default_format. This would be useful during migration, for filling in NULL values in fields like rev_content_format.

Legoktm (talkcontribs)

Maybe I'm misunderstanding what you meant, but wouldn't we fill in NULL with whatever ContentHandler::getDefaultFormat() returns? Why does it need to be in the database table?

Duesentrieb (talkcontribs)

Having it in the database would allow us to fill in the blanks without leaving the database. It's an option to consider, which may come in handy for some use cases. But you may very well be right that we don't need it.

Legoktm (talkcontribs)

I would rather keep it in PHP, mostly because you already need PHP code to "fill in the blanks" for things like namespace names. In any case, I'd like to leave it out for now because I don't think we should have the default specified in two places (PHP and database). Plus one of the goals of this RfC is to make changing the defaults easier, and this would somewhat work against that.

Reply to "page_content_model_id and multiple formats"

Separate table for content meta-data

2
Duesentrieb (talkcontribs)

Instead of adding columns to various tables (page, revision, archive), I suggest to create a separate table that holds meta-data about revision content (at least the model and format, but we will also want things like the role/slot, blob address, and hash there later, for multi-content-revision support, see Phab:T107595). The table would have at least these fields:

CREATE TABLE /*_*/content (
  cnt_revision INT NOT NULL,
  cnt_model INT NOT NULL,
  cnt_format INT NOT NULL,
  PRIMARY KEY (cnt_revision)
) /*$wgDBTableOptions*/;

This table can then be used to acquire the model and format for a given revision by joining cnt_revision against page_current, rev_id, or ar_rev_id.

If we want to support multiple content "slots" per revision (as per Phab:T107595), cnt_revision would no longer be sufficient to identify the desired content. A cnt_role field would be added to identify the role the content plays in the revision (e.g. main, style, categories, meta, blame, etc). cnt_role would reference a content_role table defined in the same way as content_model and content_format. cnt_revision and cnt_role form a unique key. The table would then look like this:

CREATE TABLE /*_*/content (
  cnt_revision INT NOT NULL,
  cnt_role INT NOT NULL,
  cnt_model INT NOT NULL,
  cnt_format INT NOT NULL,
  -- more fields to add for multi-content-revision support:
  -- cnt_address, cnt_hash, cnt_logical_size, cnt_is_primary, etc
  PRIMARY KEY (cnt_revision, cnt_role)
) /*$wgDBTableOptions*/;

CREATE TABLE /*_*/content_role (
  cr_id smallint NOT NULL PRIMARY KEY AUTO_INCREMENT,
  cr_role VARBINARY(32) NOT NULL
) /*$wgDBTableOptions*/;

CREATE INDEX /*i*/cr_role ON /*_*/content_role (cr_role);

When joining against page_current, rev_id, etc., cnt_role will then have to be fixed (e.g. to the "main" role) to allow a unique match per revision.

Legoktm (talkcontribs)

My only concern with splitting to another table is making sure that the tables stay in sync, and if they get out of sync, handling failure gracefully. While it mostly should work properly with our transaction handling, we occasionally have random freak accidents where page_latest isn't updated properly or something, so it's not 100% perfect.Using the same table (revision/archive) would neatly avoid that problem.

Reply to "Separate table for content meta-data"
PleaseStand (talkcontribs)
Legoktm (talkcontribs)

I wasn't aware of that, updated. Thanks!

Florianschmidtwelzow (talkcontribs)

Hi legoktm,

I think this proposal makes sense and should be done. I have one (maybe dumb) qiestion: The mapping of conten model ids and formats sounds like it could be done as an php array in a global var, too. Maybe I don't see any point you considered, but wouldn't it be possible to write this as a global array instead of a new database table, and if yes, why we shouldn't do that?

Thanks for your answer!

Legoktm (talkcontribs)

It could. But then we run into the same probably we currently have with namespaces. Each extension randomly picks a number it uses for namespaces which end up conflicting.

Setting it as a global would also encourage usage by developers, which we don't want. They should not be aware of what the id is, and shouldn't care either.

This proposal is similar to what we currently do with change tags. Everywhere in code (except for the ChangeTags class) we refer to tags by their string name, but in the database they are stored as integers, and there is a table that maps the integer to the string name.

Florianschmidtwelzow (talkcontribs)

Ok, yeah, that makes sense. Thanks for your explanation! :)

Alternative proposal - Use ENUM instead of a new table

2
Wp mirror (talkcontribs)

I added a new section at the bottom of your RFC. It contains statistics from which I conclude that using ENUM would be a good alternative.

Legoktm (talkcontribs)

Hi,

Using an enum is not feasible. First off, the page/revision/archive tables are in MediaWiki core, and should not contain any references to extensions (Flow, Wikibase, Scribunto, etc.). Currently any extension can define arbitrary content models and formats, like MassMessage's MassMessageList content type (example).

Also, using an enum also means we need to do a schema change any time we wish to add a new content model or format.

Finally, I would have preferred discussing your alternative proposal on the talk page before we added it to the RfC. I've removed it from the page now because it doesn't fit our requirements.

Reply to "Alternative proposal - Use ENUM instead of a new table"
There are no older topics