Flow/Database

From mediawiki.org

See sql/tables.json for the up-to-date schema, though without these example queries.

flow_workflow[edit]

  • defines a flow instance, which is based on workflow_namespace and workflow_title_text (corresponding to a MediaWiki page title/ns).
CREATE TABLE `flow_workflow` (
  `workflow_id` binary(16) NOT NULL,
  `workflow_wiki` varchar(64) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `workflow_namespace` int(11) NOT NULL,
  `workflow_page_id` int(10) unsigned NOT NULL,
  `workflow_title_text` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `workflow_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `workflow_last_update_timestamp` binary(14) NOT NULL,
  `workflow_lock_state` int(10) unsigned NOT NULL,
  `workflow_type` binary(16) NOT NULL,
  PRIMARY KEY (`workflow_id`),
  KEY `flow_workflow_lookup` (`workflow_wiki`,`workflow_namespace`,`workflow_title_text`),
  KEY `flow_workflow_update_timestamp` (`workflow_last_update_timestamp`); 
) ENGINE=InnoDB DEFAULT CHARSET=utf8
  • index `flow_workflow_lookup` is used to look up a particular workflow on a page based on workflow namespace and title, example query:
mysql> explain select * from flow_workflow where workflow_wiki = "mediawikiwiki" and workflow_namespace = "103" and workflow_title_text = "LDAP_Authentication" order by workflow_id DESC limit 1;
+------+-------------+---------------+------+----------------------+----------------------+---------+-------------------+------+----------------------------------------------------+
| id   | select_type | table         | type | possible_keys        | key                  | key_len | ref               | rows | Extra                                              |
+------+-------------+---------------+------+----------------------+----------------------+---------+-------------------+------+----------------------------------------------------+
|    1 | SIMPLE      | flow_workflow | ref  | flow_workflow_lookup | flow_workflow_lookup | 327     | const,const,const |  365 | Using index condition; Using where; Using filesort |
+------+-------------+---------------+------+----------------------+----------------------+---------+-------------------+------+----------------------------------------------------+
1 row in set (0.00 sec)

Better to use page id if you have it:

mysql> explain select * from flow_workflow where workflow_wiki = "mediawikiwiki" and workflow_page_id = 461183 order by workflow_id DESC limit 1;
+------+-------------+---------------+-------+----------------------+---------+---------+------+------+-------------+
| id   | select_type | table         | type  | possible_keys        | key     | key_len | ref  | rows | Extra       |
+------+-------------+---------------+-------+----------------------+---------+---------+------+------+-------------+
|    1 | SIMPLE      | flow_workflow | index | flow_workflow_lookup | PRIMARY | 11      | NULL |    2 | Using where |
+------+-------------+---------------+-------+----------------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)

flow_topic_list[edit]

  • discussion flow to topic flow association so we can pull a list of topics for a particular discussion
CREATE TABLE `flow_topic_list` (
  `topic_list_id` binary(16) NOT NULL,
  `topic_id` binary(16) DEFAULT NULL,
  UNIQUE KEY `flow_topic_list_pk` (`topic_list_id`,`topic_id`),
  KEY `flow_topic_list_topic_id` (`topic_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
  • index `flow_topic_list_topic_id` used to pull a list of topic_id for a topic_list_id, example query:
mysql> explain SELECT *  FROM `flow_topic_list`   WHERE topic_list_id = unhex('050B8FE7DE931041240478')  ORDER BY topic_id DESC LIMIT 500;
+------+-------------+-----------------+------+---------------+---------+---------+-------+------+--------------------------+
| id   | select_type | table           | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+------+-------------+-----------------+------+---------------+---------+---------+-------+------+--------------------------+
|    1 | SIMPLE      | flow_topic_list | ref  | PRIMARY       | PRIMARY | 11      | const |  650 | Using where; Using index |
+------+-------------+-----------------+------+---------------+---------+---------+-------+------+--------------------------+
1 row in set (0.01 sec)
mysql> explain SELECT *  FROM `flow_topic_list`,`flow_tree_revision`,`flow_revision`  WHERE (tree_rev_id = rev_id) AND (tree_rev_descendant_id = topic_id) AND topic_list_id = unhex('050B8FE7DE931041240478') ORDER BY rev_id DESC LIMIT 500;
+------+-------------+--------------------+--------+-------------------------------------+-----------------------------+---------+---------------------------------------+------+-----------------------------------------------------------+
| id   | select_type | table              | type   | possible_keys                       | key                         | key_len | ref                                   | rows | Extra                                                     |
+------+-------------+--------------------+--------+-------------------------------------+-----------------------------+---------+---------------------------------------+------+-----------------------------------------------------------+
|    1 | SIMPLE      | flow_topic_list    | ref    | PRIMARY,flow_topic_list_topic_id    | PRIMARY                     | 11      | const                                 |  650 | Using where; Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | flow_tree_revision | ref    | PRIMARY,flow_tree_descendant_rev_id | flow_tree_descendant_rev_id | 11      | flowdb.flow_topic_list.topic_id       |    1 |                                                           |
|    1 | SIMPLE      | flow_revision      | eq_ref | PRIMARY                             | PRIMARY                     | 11      | flowdb.flow_tree_revision.tree_rev_id |    1 |                                                           |
+------+-------------+--------------------+--------+-------------------------------------+-----------------------------+---------+---------------------------------------+------+-----------------------------------------------------------+
3 rows in set (0.00 sec)

EWWW temporary AND filesort. Anyways...

  • index `flow_topic_list_topic_id` is used to look up the topic_list_id for a topic_id, example query:
mysql >  explain SELECT * FROM `flow_topic_list` WHERE topic_id = unhex('050B8FE7DEB71041240478') LIMIT 1;
+------+-------------+-----------------+------+--------------------------+--------------------------+---------+-------+------+--------------------------+
| id   | select_type | table           | type | possible_keys            | key                      | key_len | ref   | rows | Extra                    |
+------+-------------+-----------------+------+--------------------------+--------------------------+---------+-------+------+--------------------------+
|    1 | SIMPLE      | flow_topic_list | ref  | flow_topic_list_topic_id | flow_topic_list_topic_id | 11      | const |    1 | Using where; Using index |
+------+-------------+-----------------+------+--------------------------+--------------------------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

flow_tree_revision[edit]

  • topic/post content revisions, it has one to many relation to flow_revision table
flow_tree_revision | CREATE TABLE `flow_tree_revision` (
  `tree_rev_descendant_id` binary(16) NOT NULL,
  `tree_rev_id` binary(16) NOT NULL,
  `tree_orig_user_id` bigint(20) unsigned NOT NULL,
  `tree_orig_user_ip` varbinary(39) DEFAULT NULL,
  `tree_orig_user_wiki` varbinary(64) NOT NULL,
  `tree_parent_id` binary(16) DEFAULT NULL,
  PRIMARY KEY (`tree_rev_id`),
  UNIQUE KEY `flow_tree_descendant_rev_id` (`tree_rev_descendant_id`,`tree_rev_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
  • index `flow_tree_descendant_rev_id` - mainly joins with flow_revision table and accessed by either `tree_rev_descendant_id` or `tree_rev_id`

example query:

mysql> explain SELECT *  FROM `flow_tree_revision` JOIN `flow_revision` `rev` ON ((tree_rev_id = rev_id))  WHERE tree_rev_descendant_id = unhex('04934C2C5C049D5BF77CEA')  ORDER BY rev_id DESC LIMIT 1;
+------+-------------+--------------------+--------+-------------------------------------+-----------------------------+---------+---------------------------------------+------+--------------------------------------------------------+
| id   | select_type | table              | type   | possible_keys                       | key                         | key_len | ref                                   | rows | Extra                                                  |
+------+-------------+--------------------+--------+-------------------------------------+-----------------------------+---------+---------------------------------------+------+--------------------------------------------------------+
|    1 | SIMPLE      | flow_tree_revision | ref    | PRIMARY,flow_tree_descendant_rev_id | flow_tree_descendant_rev_id | 11      | const                                 |    3 | Using index condition; Using temporary; Using filesort |
|    1 | SIMPLE      | rev                | eq_ref | PRIMARY                             | PRIMARY                     | 11      | flowdb.flow_tree_revision.tree_rev_id |    1 |                                                        |
+------+-------------+--------------------+--------+-------------------------------------+-----------------------------+---------+---------------------------------------+------+--------------------------------------------------------+
2 rows in set (0.00 sec)

flow_revision[edit]

  • individual revision storage for header/topic/post
CREATE TABLE `flow_revision` (
  `rev_id` binary(16) NOT NULL,
  `rev_type` varchar(16) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `rev_user_id` bigint unsigned NOT NULL,
  `rev_user_ip` varbinary(39) DEFAULT NULL,
  `rev_user_wiki` varbinary(64) NOT NULL,
  `rev_parent_id` binary(16) DEFAULT NULL,
  `rev_flags` tinyblob NOT NULL,
  `rev_content` mediumblob NOT NULL,
  `rev_change_type` varbinary(255) DEFAULT NULL,
  `rev_mod_state` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `rev_mod_user_id` bigint(20) unsigned DEFAULT NULL,
  `rev_mod_user_ip` varbinary(39) DEFAULT NULL,
  `rev_mod_user_wiki` varbinary(64) DEFAULT NULL,
  `rev_mod_timestamp` varchar(14) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `rev_mod_reason` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `rev_last_edit_id` binary(16) DEFAULT NULL,
  `rev_edit_user_id` bigint(20) unsigned DEFAULT NULL,
  `rev_edit_user_ip` varbinary(39) DEFAULT NULL,
  `rev_edit_user_wiki` varbinary(64) DEFAULT NULL,
  `rev_content_length` int(11) NOT NULL DEFAULT '0',
  `rev_previous_content_length` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`rev_id`),
  UNIQUE KEY `flow_revision_unique_parent` (`rev_parent_id`),
  KEY `flow_revision_user` (`rev_user_id`,`rev_user_ip`,`rev_user_wiki`),
  KEY `flow_revision_type_id` (`rev_type`,`rev_type_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
  • This table is primarily accessed by the primary key rev_id

flow_tree_node[edit]

  • closure table implementation of tree storage in sql
CREATE TABLE `flow_tree_node` (
  `tree_ancestor_id` binary(16) NOT NULL,
  `tree_descendant_id` binary(16) NOT NULL,
  `tree_depth` smallint(6) NOT NULL,
  PRIMARY KEY (`tree_ancestor_id`,`tree_descendant_id`),
  UNIQUE KEY `flow_tree_constraint` (`tree_descendant_id`,`tree_depth`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
  • example queries:
mysql> explain SELECT tree_ancestor_id,tree_descendant_id  FROM `flow_tree_node`  WHERE tree_ancestor_id IN (unhex('04934C2C5B0CA0771C21C2'), unhex('04934C2C5C049D5BF77CEA'));+------+-------------+----------------+-------+---------------+---------+---------+------+------+--------------------------+
| id   | select_type | table          | type  | possible_keys | key     | key_len | ref  | rows | Extra                    |
+------+-------------+----------------+-------+---------------+---------+---------+------+------+--------------------------+
|    1 | SIMPLE      | flow_tree_node | range | PRIMARY       | PRIMARY | 11      | NULL |   83 | Using where; Using index |
+------+-------------+----------------+-------+---------------+---------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
mysql> explain SELECT tree_ancestor_id, tree_depth  FROM `flow_tree_node` WHERE tree_descendant_id = unhex('04C0E28D6C1CB60D82B91A');
+------+-------------+----------------+------+----------------------+----------------------+---------+-------+------+--------------------------+
| id   | select_type | table          | type | possible_keys        | key                  | key_len | ref   | rows | Extra                    |
+------+-------------+----------------+------+----------------------+----------------------+---------+-------+------+--------------------------+
|    1 | SIMPLE      | flow_tree_node | ref  | flow_tree_constraint | flow_tree_constraint | 11      | const |    7 | Using where; Using index |
+------+-------------+----------------+------+----------------------+----------------------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

flow_wiki_ref[edit]

 CREATE TABLE `flow_wiki_ref` (
  `ref_src_object_id` binary(11) NOT NULL,
  `ref_src_object_type` varbinary(32) NOT NULL,
  `ref_src_workflow_id` binary(11) NOT NULL,
  `ref_src_namespace` int(11) NOT NULL,
  `ref_src_title` varbinary(255) NOT NULL,
  `ref_target_namespace` int(11) NOT NULL,
  `ref_target_title` varbinary(255) NOT NULL,
  `ref_type` varbinary(16) NOT NULL,
  `ref_src_wiki` varbinary(16) NOT NULL,
  `ref_id` binary(11) NOT NULL,
  PRIMARY KEY (`ref_id`),
  KEY `flow_wiki_ref_idx_v2` (`ref_src_wiki`,`ref_src_namespace`,`ref_src_title`,`ref_type`,`ref_target_namespace`,`ref_target_title`,`ref_src_object_type`,`ref_src_
  KEY `flow_wiki_ref_revision_v2` (`ref_src_wiki`,`ref_src_namespace`,`ref_src_title`,`ref_src_object_type`,`ref_src_object_id`,`ref_type`,`ref_target_namespace`,`re
) ENGINE=InnoDB DEFAULT CHARSET=binary

flow_ext_ref[edit]

CREATE TABLE `flow_ext_ref` (
  `ref_src_object_id` binary(11) NOT NULL,
  `ref_src_object_type` varbinary(32) NOT NULL,
  `ref_src_workflow_id` binary(11) NOT NULL,
  `ref_src_namespace` int(11) NOT NULL,
  `ref_src_title` varbinary(255) NOT NULL,
  `ref_target` blob NOT NULL,
  `ref_type` varbinary(16) NOT NULL,
  `ref_src_wiki` varbinary(16) NOT NULL,
  `ref_id` binary(11) NOT NULL,
  PRIMARY KEY (`ref_id`),
  KEY `flow_ext_ref_idx_v2` (`ref_src_wiki`,`ref_src_namespace`,`ref_src_title`,`ref_type`,`ref_target`(255),`ref_src_object_type`,`ref_src_object_id`),
  KEY `flow_ext_ref_revision_v2` (`ref_src_wiki`,`ref_src_namespace`,`ref_src_title`,`ref_src_object_type`,`ref_src_object_id`,`ref_type`,`ref_target`(255))
) ENGINE=InnoDB DEFAULT CHARSET=binary

Sample queries[edit]

For non-private wikis on the WMF cluster, these should be run on analytics-store. From stat1003, run:

mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -hanalytics-store.eqiad.wmnet

If you are not using a separate Flow cluster, you can connect to the right database (mywiki), remove the DB prefixes below (flowdb., mediawikiwiki.).

Users posting to a Flow board (may not include summary or header/description)[edit]

 select user_name
   from mediawikiwiki.user
  where user_id in
        (select distinct rev_user_id
           from flowdb.flow_workflow
                    inner join
                flowdb.flow_tree_node       on workflow_id = tree_ancestor_id
                    inner join
                flowdb.flow_tree_revision   on tree_descendant_id = tree_rev_descendant_id
                    inner join
                flowdb.flow_revision        on tree_rev_id = rev_id
          where workflow_wiki='mediawikiwiki'
            and workflow_page_id in
                (select page_id
                   from mediawikiwiki.page
                  where page_namespace = <NAMESPACE_ID>
                    and page_title like <PAGE TITLE WITHOUT NAMESPACE PREFIX>
                )
        )
;