Talk:Revtagging

From mediawiki.org

Separation of OLTP/OLAP[edit]

"No, in order to answer the above questions we will need to combine revtagging data and data from the MediaWiki database"

When combining revtagging data with online data in the database, can this be done through the API interface or something of this nature? I'm thinking that there is no need to record at-event-time the data that is online as long as the revtag is recorded with the event since the information can be extracted later.

Are there any other requirements here that require intermixing of transactional data and analytical logging? Tychay (talk) 00:22, 20 June 2012 (UTC)Reply

Tagging using Bitmaps[edit]

It'll be fast and space-efficient to use bitmaps to implement tagging. enwiki has ~17m users.

MediaWiki core already has a revision tagging feature[edit]

I'm a little confused by this page. MediaWiki core already has a revision tagging feature. It's described at Manual:Tags. It involves the following database tables:

Is this tagging system insufficient? If so, why? --MZMcBride (talk) 19:22, 13 November 2012 (UTC)Reply

Good point! However, there are a couple concerns:
  1. some analytical data doesn't need to be stored (A-B test campaigns) as RecentChanges tags. (You don't want Special:Tags to get all cluttered ;-).)
  2. extracting data during analytical processing is expensive, especially when it is buried inside the RecentChanges db structure (not only that, it needs to be extracted from the ts_tags blob in the data field)
  3. data is not efficiently stored in tags (the tag text is stored each time), these are more efficently stored as a bitfield or something similar, at least for analytics. (This is efficient for RecentChanges,FlaggedRevs,AbuseFilter, etc. since it is only one blob covering all those cases and the only time it needs to be accessed is when the revision is accessed.)
I think at times when this information is coincident, we can (and should) put it in the transactional database using ChangeTags::AddTags(). I think what RevTagging is asking is that the Analytical system also be "pinged" with the same information via a call to the pixel service so that the database doesn't need to be joined with LIKE %tagname% query --Tychay (talk) 04:01, 16 November 2012 (UTC)Reply
Note that if there was a use for it we could easily switch to a database format that allows individual tags to actually be queried. Daniel Friesen (Dantman) (talk) 04:30, 16 November 2012 (UTC)Reply
I'm not sure what you mean by "buried inside the RecentChanges db structure". --MZMcBride (talk) 04:37, 16 November 2012 (UTC)Reply
Point taken, I was confused between change_tag and tag_summary. I didn't realize they are unormalized with respect to each other. However, let's say we are using change_tag and not tag_summary, and we want to find a count of all edits tagged with "mobileedit" we still have SELECT count(*) FROM change_tag as ct AND revision as r WHERE ct.ct_tag='mobileedit' AND ct.ct_rev_id=r.rev_id AND rev_timestamp>(some number). the index on change_tag is a multikey ct_rev_id and ct_tag, so the thing will first do a select of ALL revisions in the last 30 days, before doing the join. (my bad there is already a 4th multi-index on ct_tag as the first, if that is used/hinted it shouldn't be too expensive to query)
Not to mention that Special:Tags would be cluttered with tags.
So the cost is:
  1. the database footprint (storage of an extra row (or two per revision, an extra row per tag + recent changes row(s)
  2. the extra time to join against a live data store (which could be alleviated if we had better use of memcache to hold intermediates)
  3. the clutter in special pages,
I think many revtags can be doable within edittags as you mention. Note though the queries I have provided are not accurate because edittags are part of recentchanges so the addition/removal of a tag is tracked inside recentchanges so there might need to be some additional tracking to take that into account. For instance, recentchanges gets purged periodically, what happens to change_Tag and tag_summary? --Tychay (talk) 02:09, 27 November 2012 (UTC)Reply