Topic on Talk:Multi-Content Revisions/Database Schema

<code>slot_page</code> and <code>content_page</code>

6
Summary by Daniel Kinzler (WMDE)

Should we put the page ID into the new slots and revision tables, for sharding, analysis, and recovery?

Anomie (talkcontribs)

I'd like to hear from JCrespo as to whether this is needed or desirable. I'm skeptical that access patterns will actually make sharding by page ID more efficient.

The proposed "slot history" would potentially be more efficient if the slot table were sharded by page ID, since it will be having to join that table to hit the slot_inherited field. Even without sharding, I suppose the DB could select on slot_page, slot_inherited rather than using rev_page then filtering based on the join.

I can't think of any common situation where sharding the content table by page ID would be more efficient. Nor can I think of a situation where we'd want to use content_page directly in a query.

Daniel Kinzler (WMDE) (talkcontribs)

As far as I know, there is only one use case where we query the revision table without looking for a specific page: user contributions. In all other situations we can and should provide a page ID.

Sharding isn't the only reason to have this, though. It also provides some low-cost redundancy that should help a lot with analysis and repair.

Anomie (talkcontribs)

API list=allrevisions also queries the revision table without looking for a specific page.

RecentChanges and Watchlist, both API and web UI, have many pages to query but join revision by rev_id rather than page, so they'd likely see little or no benefit from this sharding either.

Also, API prop=revisions can specify up to 5000 specific pages and fetch one revision per page, so sharding would be of no benefit there either.

I also note that sharding gives no advantage if you're fetching information for just one revision, which covers the most common cases. And it's probably not that useful if you're fetching information for just two revisions either, which covers diffs. Page histories probably don't need the content and may not even need the slot information. What's left? Just API prop=revisions in single-page mode and dumps?

I'm even more skeptical of this proposal if the only remaining justification is "analysis and repair". How often do we ever need to repair, and how often would there be an analysis that needs to worry about optimization at this level and that would even benefit from it?

Daniel Kinzler (WMDE) (talkcontribs)

The main reason to introduce sharding is not query performance. The main reason is keeping the total size of tables down, to keep them manageable for schema updates. On-disk size is also a concern.

At least this is from what I gathered from the feedback I got from Jaime and others during earlier rounds of MCR schema discussions. My original proposal does not have the page ID here and did not consider sharding. I added this to reflect discussions we hat at the Summit and at the Hackathon.

In any case: sharding the revision table is not part of this proposal. I'm only suggesting to define the new slots and content tables in a way that allows sharding to be introduced easily. The comments on sharding the revision table are idle musing at this point. Changing revision to use page_id+rev_id as its primary key would indeed be a bit of work, and a separate project.

As to the cases you mentioned:

  • Listing all revisions would need some extra logic to iterate over all shards. Order would not be by revision ID or page ID, but by shard+page-rev. This needs extra code, but should not be a problem for performance.
  • Anything that selects by or joins against rev_id would need to be changed to include page_id in the condition, if we start sharding the revision table.
  • prop=revisions is a problem without page IDs. Without page IDs, finding specific revisions becomes expensive.
  • When selecting a single revision, or two revisions, of a single page, sharding gives a slight advantage, since you are hitting a smaller DB index (provided you also know the page id, which you should).

But none of this is required for MCR, nor implied by the proposed schema.

Daniel Kinzler (WMDE) (talkcontribs)

@Anomie: do you think slot_page and content_page are problematic or need further discussion? Otherwise, I think this thread can be closed.

Anomie (talkcontribs)

I really want to hear from Jaime on this one. The schema would be a lot cleaner without them.

Reply to "<code>slot_page</code> and <code>content_page</code>"