Talk:Multi-Content Revisions/Content Meta-Data

About this board

I do not support the current proposal, I will create an alternative one

2
JCrespo (WMF) (talkcontribs)

Multi-Content Revisions/Content Meta-Data#Scalability This is wrong, and it doesn't have into account that you are basically solidifying in the future things such as image_revisions and (page)revisions, and category_revisions, and all types of content revisions on the same couple of tables, generating hugely tall tables; when we already have problems with the current sizes, and no one has produced a solution with proper partitioning that works in all cases. Currently there is partitioning rolled in for logging and revisions, but it is completely undocumented and doesn't work except for a few cases (because it requires index changes). I do not see you solving that issue.

For those people that support this "clean" proposal, know that this will be slower than more conservative approaches, which could keep the same idea but not slowing down the queries and making maintenance much slower. Maintenance slow == deployments slow, mediawiki growth==slow.

You should keep meta-information for revisions of different kinds on separate tables, otherwise, this will not work for performance reasons, but for maintenance reasons too (the multiple extensions that will add information to the same table and that, in your own words "we will never clean up". The same way other 20 extensions created content and were later discontinued (but at least, they stored data on its own separate set of tables).

This could seem all clean and nice on paper to the casual obverver, but I do not see you giving actual numbers based on actual performance issues that we are currently suffering. And you have showed me many times that you clearly lack the db knowledge to take care of those ("we could just use mysql partitioning, right?"). Which is not a big deal at all, except for the fact that you actively ignore the warnings from those that suffer them everyday and end up harass them by email just to get your point across.

I tried to work with you, but as you do not want me to, I will have to present my own alternative multi-content revision proposal, in which we have into account realistic maintenance and performance issues, and a migration plan that requires minimal changes to the database, so can be deployed faster, is more backwards compatible, and does not produce 10000 million tables.

Daniel Kinzler (WMDE) (talkcontribs)

Thank you for the feedback. I have clarified some points of my proposal that perhaps lead to some misconceptions. If you have an alternative approach that allows us to cover the same needs at a lower cost, I'm very interested in hearing about it. In the mean time, I would like to respond to some of the points you made:

  • Extensions don't add revisions, users do. One edit, one revision. The only thing that would create more revisions is users making more edits.
  • "category_revisions, and all types of content revisions" are already there, they are just either in the wikitext, or in associated pages. Editing them already creates a revision now, this will not change. We just gain a dedicated editing interface resp. and integrated page history.
  • "we will never clean up" refers to the fact that we will not throw away use generated content. Once it is there, its entire history needs to be kept available.
  • replacing image_revisions (resp oldimage) with MCR was an idea floated in the context of the oldimage RFC, but it's not part of this proposal, nor is any commitment to that idea implied.
  • 10000 million rows (in a table containing just two integers) is a pessimistic extrapolation for the case of quadrupling our content - which at the current doubling rate will take 16 years.
  • Partitioning strategies for MySQL tables are out of scope of this proposal, but should of course be considered if partitioning is a blocker to deploying MCR. It would make sense to have a separate RFC to discuss partitioning options.
  • From my limited understanding of MySQL partitions, partitioning by content model or role does not seem a good match for our access patterns, as fetching the content for a single revision would hit multiple partitions. We should rather partition by modulo of the page id, so all information for the revisions of a single page will be on the same partition. Partitioning by namespace would also be an option. But as I said, that's a topic for a separate RFC.

Anyway, I'm very curious to see your alternative proposal for integrating the revisions of multiple documents in a single page history, with minimal changes to the database. Let me know when you have a draft I can look at.

Reply to "I do not support the current proposal, I will create an alternative one"

"Phase II: Migrate Archive" makes no sense

2
Anomie (talkcontribs)

Phase II: Migrate Archive says "Set MediaWiki to not remove rows from the text table when deleting revisions", but MediaWiki only ever did that between r7318 (30 Jan 2005) and r8772 (1 May 2005). The ar_text and ar_flags fields have been unused except for legacy rows since r8772. The sub-bullets there don't make sense either.

And unless you're adding logic to MediaWiki to copy data out of the content table back into the text table on undeletion, this Phase II would totally break undeletion.

What really needs to happen is one of the following:

  • Migrate rows that have non-null ar_text and ar_flags into the text table, blanking those fields and filling in ar_text_id in the process. Then populateContentTable.php would have to handle both the revision and text tables.
  • When populateContentTable.php processes the archive table, have it go from ar_text_id+ar_flags→content in one step. I'd recommend this, and it eliminates Phase II entirely.
Daniel Kinzler (WMDE) (talkcontribs)

Thanks for pointing this out Anomie. I'm a bit blurry regarding the state and content of the archive table on the live cluster. I'll adopt the proposal according to your suggestion.

Reply to ""Phase II: Migrate Archive" makes no sense"
There are no older topics