Flow/Architecture

Purpose of this Document
This document is intended to be an early brainstorming session regarding the first flow prototype. Final implementation will likely only tangentially look like what you see here now. In other words this is a pre-implementation brain dump which will be referenced while building the first prototype. In my experience with the ideas for internal architecture laid out the first implementation doesn't take long to actually convert into code.

DISCLAIMER: All decisions, especially in regards to storage layout, are fluid. We fully expect to delete the data created by the prototype on extensiondb1 before deploying the final solution.

Big Ideas
Flow is about workflow management. A "discussion" is a type of workflow - a very simple one.

Templates are bad, mmkay
In many cases templates are used to encourage workflow within local wikis. The goal for the various workflow models is to be dynamic enough to be managed from the wiki (by local administrators) to cover use cases currently handled by workflow suggestions inside templates. In other words flow will implement a whole bunch of Lego pieces, and the individual community's will stick them together into the various workflows they need.

WorkflowObject
Globally unique identifier of an individual instance of a workflow. A discussion topic is a FlowObject instance. A request for deletion of page Abc is a FlowObject instance. etc.


 * ID - maybe UUID
 * Home wiki ?
 * Only needed if we decide to have a single combined flow that supports multiple wikis.
 * May not be needed even then.
 * Home page
 * Essentially where is this flow accessed from/was created
 * createDate
 * lastUpdateTime
 * creatorId
 * does this distinctly identify the user? likely needs to be combined with homeWiki to decide exactly which user this is
 * If we need two pieces of data to uniquely identify a user then any and all relationships (tags/revisions/etc) must also store those two pieces of data.
 * At a minimum we will need some way to map from a creatorId into a displayable username.
 * The common solution to this is to denormalize and store the users name in the record
 * Could we piggy back off of whatever the single sign on does?
 * title/name
 * Possibly tied into wikidata for automatic localization?
 * summaryText
 * lockState
 * Enumeration, not boolean. Allows for multiple states beyond locked/unlocked,
 * workflowModel
 * contentLanguage
 * Workflows may occur in different languages, this can help act as a filter for workflows a user can understand.

Considerations
For example, if an open (based on lockState?) request for deletion exists on a specific article, cannot open a new one. If a closed RFD exists for a page it should possibly be linked to a newly opened RFD for context. Various restrictions like this need to be definable in a generic way by the wiki community. The prototype implementation may hard code these types of considerations to be fleshed out at a later point.

ID Generation
First step to enabling horizontal scalability by removing dependence on a central sequence to give out id numbers. As we plan to use these ids within mysql for inserts they need to be constantly increasing, randomly generated ids cause performance issues in mysql's storage system.

UUID

 * MediaWiki contains a UIDGenerator class which generates either 88 or 128 bit uuids
 * uses file locking on the application server to guarantee uniqueness in non-random uuids
 * Can generate timestamp prefixed uuid's which fulfills the requirement for constantly increasing id values
 * It would be pointless to store the GUID's in their text representation, must be binary(16) or decimal(39).
 * text->binary: UNHEX( '7b95ee9a9d8a4f3283e2c413cc764b2d' )
 * binary->text: HEX( binary_uuid_column )

Ticket Servers
http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

tl/dr: The Tickets64 schema looks like:

CREATE TABLE `Tickets64` (     `id` bigint(20) unsigned NOT NULL auto_increment,      `stub` char(1) NOT NULL default '',      PRIMARY KEY  (`id`),      UNIQUE KEY `stub` (`stub`)    ) ENGINE=MyISAM

SELECT * from Tickets64 returns a single row that looks something like:

+---+--+   | id                | stub | +---+--+   | 72157623227190423 |    a | +---+--+

When I need a new globally unique 64-bit ID I issue the following SQL:

REPLACE INTO Tickets64 (stub) VALUES ('a'); SELECT LAST_INSERT_ID;

To prevent SPOF they use this on multiple servers, with one chosen at random when an insert needs to be done, which serve up differing ids using mysql auto-increment-offset and auto-increment-increment settings.

Time + shard + autoincr
http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram

https://github.com/twitter/snowflake

tl/dr: Instagram combines 3 different values into a 64 bit number. The first 41 bits of the id is a timestamp in ms, so the keys naturally increase and provide a logical sorting order across shards. Snowflake from twitter uses the same id generation method, but with worker_id rather than shard_id. Really its about the same thing.


 * 41 bits for time in milliseconds (gives us 70 years of IDs with a custom epoch)
 * 13 bits that represent the logical shard ID
 * 10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs per shard per millisecond

This is done with a stored procedure in postgresql in their version. Our setup with MariaDB as of 5.3 can serve up millisecond timestamps, and the REPLACE INTO method from can be used for the auto-incrementing sequence. We should prefer to use the database ms timestamp so the timestamp and auto-incr counter come from same host strengthening the uniqueness of the id. If we wish to separate id generation from the database connection we could request ops to run the snowflake id generation service, but the added complexity of yet another service may not be necessary. We can also shuffle the bit counts around a little bit to get different tradeoffs, the proposed 864 logical shards will need only 10 bits for the shard id.

Other
As i see it, there are basically three ways to choose/store a shard_id:


 * 1) Generate a UUID and calculate a shard from it, that will get us random distribution of objects across the shards which is unacceptable for our use case.
 * 2) Store an independant column that holds the shard id or something that a shard id can be calculated from like page_id or user_id
 * 3) Choose your own shard id and integrate the into the UUID

By choosing our own shard key in options 2 & 3 we can co-locate all FlowObjects related to the same article onto the same shard. By integrating the shard id into the UUID as in option 3 we simplify an extra field out of all relationships and queries that interface with the FlowObject, the tradeoff is we have to map to a logical shard once and we cannot change that implementation down the road. We find option 2 offers the best middle ground between the competing factors and keeps the implementation simpler.

Proposed ID Generation Scheme
Tables will all have composite primary keys. The first primary key will be the partition key. The partition key will be either user_id or page_id. This means the ID's we generate within flow do not have any effect on sharding, they just need to be unique.

We cannot use auto-incr columns for any ids within flow, keeping things properly sync'd when doing splits would be unwise. Like the auto-incr field we should use an id that is unique and always increasing. The simplest solution for this is the timestamp based 128bit UUID's. These do incur the penalty of needing 128bits to ensure uniqueness, when we are very unlikely to need ids of even 64bits were they issued similar to auto-incr starting t 0. We find the simplicity of implementation and reduction in moving parts of 128bit uuid's outweighs their cost and will utilize them for all ids generated within flow.

Benefits:
 * No external moving parts or network requests. Id's generating on the client and guaranteed unique
 * Everything with an Id has a built in creation timestamp
 * Unique: UUID's are specifically designed for this distributed id generation
 * Available, Uncoordinated: Client generates its own keys, no network partitions to worry about
 * Directly Sortable: The ids should be sortable without loading the full objects that the represent. This sorting is by creation timestamp.

Downsides:
 * Sort order can be off as much as time drift between application servers
 * They are maintained with reasonably similar clocks, the small differences dont matter
 * 128bit id's means larger indexes that take more memory( what % though? not double )
 * Dealing with memory issues is not as big of a concern after sharding

I think the benefits outway the downsides, other opinions?

For more information about sharding, see

WorkflowMetadata
Each flow object maps to some type of workflow metadata. The metadata is specific to the type of workflow that is being worked with. The specific type of metadata to expect is defined by the workflowModel field of the FlowObject.

The various models need to be programmable such that local wikis can use them as they need and not be locked into pre-programmed ideas. Templates currently allow a great deal of flexibility with no automated enforcement. These models must represent a middle ground between template flexibility and pre-programmed strict workflows.


 * 1 to Many mapping between FlowObject and the Workflow Metadata rows
 * Each row of the metadata is a key/value pair and the object id
 * Metadata needs to be rather abstract to support all the different use cases. Likely it could be implemented like user options are currently.
 * Collecting all the metadata for an object would be something like SELECT metadata_key,metadata_value FROM flow_metadata where metadata_object_id = '12345'

WorkflowModel
Each wiki will be able to define their own models. In the initial prototype the discussion model will be hardcoded implementation, but the model storage will store the info needed to turn the wiki's model definition into an instantiated object capable of understanding the metadata and performing actions.

For wiki defined models there are at least two options:
 * Real wiki page containing the definition in the WDL format
 * Special page and corresponding model table in database

Possible workflows to support include but are not limited to:
 * 2 way user conversation (user talk page owner<->talker)
 * 3 way user conversation (talk page any<->any)
 * Request for deletion
 * Request for adminship
 * General consensus discussions
 * Village Pump, Forum, etc.
 * AN/I
 * Help desk
 * Barnstars/Wikilove (and other templates of this variety)
 * Block Notices (you've been blocked, click this button to appeal)
 * moar

Special page implementation could offer a hook for models to provide options needed to configure the model on wiki. A Db table can then store the configured model to a unique name allowing it to be used as a model on any articles flow. A model should not be limited to a single type of block, it should be able to string together multiple blocks into a single result page. For example the Discussion is a Summary block at the top of the page and a TopicList block below that. The TopicList is a Block acting as a wrapper around a set of Topic Blocks. A Topic Block is a set of SinglePost Blocks arranged in a tree. (This may be too abstract)

The implementation must be generic enough so each Block is its own subscribable workflow object; gaining subscriptions, tagging, and other properties of the object. The implementation must gracefully handle updates to the topics triggering notifications at the discussion level.


 * FlowDiscussion
 * 0 or 1 Summary
 * 1 to Many: FlowPostSingular


 * FlowRequestForDeletion
 * 1 Summary (Reason for request)
 * 1 to Many: FlowEnumeratedLines
 * FlowBlockNotice
 * 1 Summary (Block Reason)
 * Functional Elements
 * Button for 'appeal this block'
 * Completely dynamic and described by a 'Workflow Description Language' but not a part of the initial prototype implementation.
 * 1 to Many: FlowPostSingular


 * And Many More

WorkflowAction
This typically represents an action performed by a single user. In some cases like the Summary object there is a single object representing all actions, but internally it will reference the Revisions table which maintains the single-action relationship. In general cases this could be a reply to a message or a vote in a consensus discussion, etc. There will typically be a 1 to Many relationship between a WorkflowObject and WorkflowActions as defined by the WorkflowModel.


 * FlowSummary - Summary objects are 1 to 1 with a particular model, allowing for arbitrary content to be displayed with the models that require a summary.
 * objectId
 * text
 * comment


 * FlowEnumeratedLines
 * objectId
 * Enum value (e.g. vote yes/no, etc.)
 * text
 * comment


 * FlowPostSingular - A tree structure representing a discussion. Possible methods of storing trees at Flow_Portal/Architecture/Discussion_Storage


 * objectId
 * If sharding by object_id, it may be required for consistency sake to always include the object uuid.
 * createdByUserId
 * replyToFlowPostSingularId?
 * revision
 * content
 * summary
 * etc.

Revision
Posts, Summarys, and possibly other content within Flow needs to be revisioned much like normal wiki pages. Ideally we want to reuse the existing revisions code that is already written. Revisions within flow will be slightly different from revisions of a wiki page though, most notably wiki revisions store a full pages content, but we will have many small fragments that should be individually stored.

The simplest way to store all these fragments would be to create a wiki page for each possible FlowObject. The wiki page's revision content would then be a multi-part content type that stores the various fragments.
 * In an ideal world we should use a multi-part content type that allows us to pull individual fragments out of the document using simple string indexing, rather than having to parse the content of the entire revision before being able to pluck out just the fragments which are interesting to us.
 * The parsoid team will be starting on something along these lines in July: https://bugzilla.wikimedia.org/show_bug.cgi?id=49143

Store Flow revisions in the master database in the standard Revisions table

 * Flow revisions don't map directly to a pageId or Title object, we would need a fake page (probably in a custom namespace thats not user editable)
 * The total number of new revisions per day/week/etc shouldn't change very much versus existing workflows that also edit wiki pages directly
 * Currently every new post or summary update is already a new revision
 * Flow will most likely be storing the HTML+RDFa from Visual Editor rather than wikitext that most of the core revisio entries use.
 * ContentHandler exists and can be used to store 'different' types of data in the revision text.
 * enwiki primary database is already heavily loaded, even if we are only using it for the revisions table and pushing the content to ExternalStorage.
 * Would lose some portion of the benefit from sharding the Flow database

Copy from the core Revision object just the revision functionality that flow needs in a new object?

 * There are many years of built up bug fixes and knowledge within the current Revision implementation, ideally we want to benefit from that instead of re-implementing.
 * Any future upgrades/bugfixes/etc to the core revision object will also have to be duplicated into the flow revision to maintain parity.
 * In other words: Maintaining feature parity will be a constant battle, perhaps not a great idea.

Start from scratch with a brand new revisions implementation?

 * Flow revisions have a different use case than Article revisions
 * Articles have hundreds of revisions
 * If flow stores a revision per post, most posts will only ever have a single revision
 * Other pieces of flow, like Discussion Summarys, will have perhaps dozens or hundreds of revisions but still not nearly as many as the article its related to.
 * My intuition is flow objects will have a fraction of the # of revisions of the same text that wiki articles do, but how to be more certain?
 * Revisions are directly related to flow objects, their pk is (page_id, revision_id) with a fk to object_id
 * Do we need a secondary index from revision_id -> page_id ?
 * Our preferred sharding strategy is to pick the correct db at the beginning of the request and not change, but we still need the page_id for indexed lookups
 * Could keep an index on just revision_id for lookups when we know we are already on the correct shard, since revision_id's are unique. This does add some small overhead.
 * Seems a bit of a hack to have page_id in the PK but not use it for queries.
 * Could attempt to define an interface that is shareable between Flow and Core Revisions.
 * Still copy the userCan series of functions, the API for this is reasonable and should be continued
 * Keep use of content handler
 * Should we still separate revision content from the revision with a text table?
 * Might keep managing things easier to stay consistent with the current Revision implementation.
 * The actual content when rendering pages will typically come from memcache so not worried about needing multiple queries or joins to get revision + content.
 * Gives the opportunity to simplify revision functionality to that which flow requires

Other thoughts

 * With flow's sharded database model we dont need to push revision content to external storage?
 * Content storage is a separate concern from the revisions, the revisions should continue to point somewhere else(text table/external storea/revisions dont care) for the actual storage. See Flow_Portal/Architecture/Discussion_Storage
 * Materialized views on revision+content data will be stored in memcache, negating any performance penalties of joining revision against a text table or fetching from external store.

Revisions, Abuse Filter, and Patrolling
If we move outside the core Revision scheme, we still need to handle tasks like the Abuse Filter, Patrolling, and more. I met Krinkle while he is in SF and got a quick 20 minute rundown on how some of this works, and all I can say at this point is its an organically designed system that would be significantly difficult to come up with a new solution for. It would also be significantly difficult to integrate into the existing system.

I'm kind of hoping that we can ignore this part at the beginning, and later with the help of Krinkle find some way to integrate with the existing solution.

Problems which need to be addressed:
 * Flow wants to store the HTML+RDFa directly from VE. The current system for auto-magically detecting abuse works by regex'ing against wiki text.  The two are not compatible.
 * Something about Bots/IRC/etc. Basically the bots
 * Anyone who knows more about this system than i do (everyone) feel free to expand this section.

Where does a post's content actually get stored?
FlowPostSingular should probably just be metadata about the post. The literal post content should likely be stored in either wiki markup or the parsoid DOM representation of the wiki markup. It should be stored in a separate table comparable to existing revision content to facilitate moderation requirements like tweaking the content of a post.

See Content Storage

A Users Flow (or feed of interesting things)
All flow objects will be subscribe-able by any number of users Generating the users feed is a sort on the subscribed flow objects ordered by last updated date. SQL is going to hate us, mysql cant answer this with a single index. Even if it could the objects are sharded across multiple database clusters.
 * Independant subscription table implementing a Many to One relationship between flow objects and users
 * Tables sharded by the object_id field, or the user_id field?
 * Basically depends on which set of queries we want to optimize for, and which will have to fan-out.

Thats ok because we plan to always read it from the cache. We will pre-build this cache as the objects are updated. This will require maintaining two sets of lists. One a simple set, per object, of all users subscribed to that object, and the other a sorted set, per user, containing object ids they are subscribed to ordered by the last update time and capped to a reasonable size(100?). Redis sorted sets will never have contain duplicate ids; adding the same value twice will just update its score(timestamp).

In relation to above, we must investigate the performance implications of a flow object that has 50k subscriptions. I don't see any particular performance issues in this case for the user with 50k subs, although their usability may decline as the proposed 100 item list will only hold a short period of time. I don't think you can reasonably expect usability from 50k subscriptions anyways.

Remembering what a user has already seen
In addition to individual objects being subscribable, to generate feeds matching the current UI prototype we also need to know what WorkflowModel's have been previously seen by the user
 * Remembering a boolean true seen status for every model instance multiplied by all editors is a ton of data and seems sub-optimal.
 * Could remember the last date a user viewed items from that WorkflowObject within the subscription and only display items newer than that
 * Each topic within a discussion is its own WorkflowObject, so the memory is more granular than just the main page, but is it enough?
 * Need to figure out how that interacts with the suggested redis implementation of subscriptions suggested above.

Suppressed Revisions
Need comparable functionality
 * Could use real wiki pages ala LiquidThreads
 * Could implement a 'work alike' along with some sort of interface to generalize the current suppressed revision code (likely rather difficult)

Search
Handled user side in javascript, in the backend, or both (probably both)?
 * Hooking into lucene extension?
 * Hooking into solr cloud?
 * What elements do the UI call for in search results?

Tags
Individual discussion topics must be taggable. How should the tagging implementation work?
 * Tags can be public or private
 * Should public and private tags be stored in the same table?
 * Conceptually it might be simpler to store them separately.
 * Storing the separately also provides a much stronger guarantee of not accidently displaying private tags
 * If sharding by page_id, where to store table mapping tag_name to object_id? Don't want to have to query every shard.
 * Figure out how to push the tags into SolrCluster and allow it to do our tag searches?
 * Need to flesh out the use cases for tags to know if SolrCluster is a complete solution or just partial.

So you want to find a talk page
MediaWiki should work great without flow. After installing the flow extension it should continue to work great and the talk pages will be flow discussions.

Currently you take the title of the page and look it up within the NS_TALK namespace. With flow we need something different, but conceptually similar?

Old talk pages: should the current urls still point to the old talk pages, or should they move and flow replaces them on those urls?
 * Current Talk pages comments likely(?) link to Talk:Something directly. For best results they should go to the new flow discussions?
 * If so, then there is also Talk:Something and we would like to continue pointing to the correct data.
 * Same concern as the 2 points above, but with urls from the internet at large. do talk pages get linked directly from outside with any frequency?

Talk page urls:
 * An article in NS_MAIN: /wiki/Talk:Volcano
 * An article in any other NS: /wiki/NameOfNamespace_Talk:Volcano

Kiss flow urls for first prototype: * /wiki/Special:Flow/NameOfNamespace:Volcano?flow=discussion

Possible Flow urls? Not currently supported in core.
 * An article in NS_MAIN: /wiki/Flow:Volcano/Talk
 * An article in any other NS: /wiki/NameOfNamespace_Flow:Volcano/Talk

Mapping URL's to Flow objects
The path Flow:Volcano/Talk should not be some super special case, there should be some way to attach specific types of flows to default paths. Visiting the page when nothing currently exists must work much like the current system, where the user is given the opportunity to create that specific thing. While we may initially hard-code /Talk into the flow prototype, it would be much better if the mappings from a name to some defaulted type of flow object is managed on wiki by the community. Additionally i18n and l10n considerations need to be taken into account as not every wiki uses the word 'Talk' for their talk pages. Mostly this matters in relation to GUID generation and what we use as the GUID namespace/name.

For the prototype i propose the following:

/wiki/Special:Flow/NameOfNamespace:Volcano?flow=discussion&name=foo


 * The flow query parameter refers to the name of a Workflow Definition. Every workflow definition must have a unique name(on a per-wiki basis).  Flow will have, by default, a few Workflow Definitions pre-defined.  One of these will be a 'discussion' workflow, and if the flow query parameter is not provided it will be defaulted to 'discussion'.
 * I'm uncertain at this point in time how i18n and l10n considerations come into play with this parameter, partially because it is defined on-wiki in the general case.


 * The name query parameter refers to the Flow Objects name. The flow object table will have a unique index on ( wiki, namespace, title, definitionId, name ).  This allows for multiple instances of a specific workflow to exist at the same time.  Workflows that require only a single object instance to exist must use a blank string as their name.
 * TBH i'm not entirely convinced object names are a great solution here, but I'm undecided on what is a better solution. We need some method of allowing multiple Flow Objects to exist for a given wiki+title+definition to account for workflow like RFD which can be completed and Locked.  In the future it must be possible to create a new RFD and have the old one arround for historical purposes.

Performance Considerations
How much data will flow need to store? For estimation purposes, Wikipedia Statistics shows there are approximatly 26M articles across all wiki's. Not all of these will have talk page, but many will. Assuming they all have talk pages ranging from just a post or two, to a couple thousand posts on the largest, we should expect a lower bound of perhaps 100M individual replies will need to be stored in perhaps 20M seperate discussion graphs. If each reply consumes 1kb of space that puts a lower bound of at least 100GB of post content before we even get into metadata, indexes, etc.

And that is just the discussions, flow will need to handle many more workflows than just discussions.

To help get an idea of the space required i applied the EchoDiscussionParser to one of the enwiki database dumps ( enwiki-latest-pages-meta-current10.xml-p000925001p001325000 ). Within this file it detected:


 * 43952 pages in either Talk: or *_Talk: namespaces
 * 211904 individual section headers
 * 514916 user signatures

That works out to an average of:


 * 5 sections per talk page
 * 2.5 signatures per section
 * 12 signatures per talk page

These pages were of course built up over time, but give us a general idea of the size of the problem we need to handle. It may be worthwhile to re-run this code and split the stats between article talk pages and user talk pages. There will be several orders of magnitude more article talk pages than user talk pages, so if they have different characteristics that may be useful information.

ext/mysqlnd
When sharding it is nearly impossible to get away from having a small selection of queries that must be answered by querying multiple shards. To cover this use case flow can query the shards serially with the current ext/mysql extension, but we should not settle for such a mediocre solution.

As of php 5.4 the ext/mysqlnd extension has replaced ext/mysql. ext/mysqlnd contains full backwards compatability with the mysql_* series of functions and is also available in packaged form for use in php 5.3. ext/mysqlnd includes asynchronous query support giving us the ability to contact all shards we are interested in at the same time. This still suffers from long-tail distribution as the number of shards contacted increases, but is obviously prefered over serial querys.

I propose we develop a testing protocol to certify 5.3 + mysqlnd for use on the foundations production application servers. This is basically a backwards compatible driver change, and should require absolutly no changes to existing MediaWiki code.

What needs to be done to get the ball rolling on this, as it will likely take at least a month to certify?


 * First step likely puppet config and labs instance. But this is a far cry from a full test
 * To get full advantage Database.php needs to be mysqli aware: https://gerrit.wikimedia.org/r/#/c/62173/7
 * To take advantage of async queries, some sort of API to utilize them will need to be developed.

Sharding key
The most natural key choices for flow are
 * the page id
 * the user id

Most data should shard by page id. Some data like subscriptions (and private tags? what else) should shard by user id. Even if subscriptions are the primary data stored in the db sharded by user_id i think the current watchlist scalability issues indicate we will need a db sharded by user_id.

To simplify loading/dumping shard data all tables must include the shard key at the begining of their primary key. Most tables will require a composite primary key due to this requirement.

For flows use case there are a couple types of user requests:


 * 1) Requests that contain the namespace + title
 * 2) Requests that contain the Flow Object UUID(and probably still contain namespace+title)
 * 3) Requests for an individual users flow(or feed)
 * 4) Requests involving search/tagging/etc

With the first two types of request, to choose a shard when sharding by page_id we must first lookup namespace+title and find the page_id it refers to. We basically require an unsharded global database from which to perform this lookup. Currently this goes through the mediawiki link cache which should be sufficient. If at some point in the future the page table was sharded following the same scheme we will have the same limitation requiring an unsharded lookup database or a massive fan-out query under the hood.

For the last two types of requests we will be querying information about flow objects from cache/solr/etc. This will be slightly complicated by the requirement to store a composite id with both the page_id and the object_id being referred to. Searches that return specific posts may instead return page_id and post_id, etc. We will need to look into how best to represent composite key references in SolrCloud.

Sharding schemes
We will handle partitioning (i.e., dividing exclusive ranges of data across many hosts) similar to twitters Gizzard by mappings ranges of data to particular shards. These mappings are stored in a forwarding table that specifies lower-bound of a numerical range and what shard that data in that range belongs to.

The forwarding table will be a part of the wiki configuration? Are there better options? As a part of the wiki configuration seems reasonable, but it may be 'slow' to roll out cluster updates in this way. Initially we will put the forwarding table into the wiki config, but I'm open to other suggestions.

Shard Colocation
Databases will hold a contiguous range of the sharding value. The only question is: should the db sharded by user_id be stored in the same mysql database as the db sharded by page_id?

To allow for options in cluster configuration i think we should make it possible to use different names for dbs sharded by page_id and user_id, but that may be more generic than we need. YAGNI and all that. I'm not opposed to having different names, I just don't think we should implement it unless requested by ops.

Splitting reads and writes to the master during shard split
The procedure for splitting a shard calls for moving all master reads to the new master, waiting for the changes to propogate to application servers, then moving all master writes to the new database. The current DatabaseBase.php does not support this, we will need to come up with a solution for gracefully handling the swap over.

Potentially we can implement a wrapper class that takes in two DataBase objects, one for old master and one for new master. It would offer the full interface and direct reads to one side and writes to the other. This breaks the guarantee of reading your own writes, we will have to be careful to not make that assumption even when operating only against master. I'm not currently aware of other options, but we should look into what else is possible.

Are there any concerns with regards to the load balancer implementation?

Data that doesn't naturally shard
There are a couple pieces of data which cannot be sharded by user_id or object_id. Examples of this data are the Workflow Definitions which are defined per wiki. A workflow definition is all the information describing a user-created workflow.

The simple method involves using the primary wiki database for this data, this feels like a perfectly valid solution for such a small amount of data. The other option is to replicate this data to all shards so it can be joined against. From an operations perspective this can get a bit hairy, it involves slaving the shard masters to a 'global shard master' that contains only the shared data. I don't think its worth the complexity. It is a trade off though, it necessitates client side collection(join) of the definitions when performing an action like generating a users flow.

Other sharding implementations
Does anyone offer servers or libraries that could simplify our implementation?
 * MySQL NDB Cluster?
 * There are no durable commits on disk. Commits are replicated, but there is no guarantee that logs are flushed to disk on commit.
 * Not as free as MariaDB
 * Its use case is very narrow: Its good at primary key lookups.  Anything else and it falls on its face.
 * MariaDB Galera Cluster
 * All data is stored on all machines, a write is only complete when it is committed to every active machine in the cluster.
 * Uses a custom commit protocol that saves time vs. 2-phase commit
 * Doesn't scale writes nearly as well as partitioning/sharding


 * External to MYSQL, there are horizontally sharding proxies like HarvestDB(not OSS, uses custom protocol), twitters Gizzard(OSS but no longer maintained), CitusDB(not OSS), and Spock Proxy(OSS, no updates since 2009).


 * There is one horizontally sharding proxy still receiving active updates: shard-query, but it attempts to transform mysql into a column-store for OLTP processing, not quite what we are looking for.


 * While not a sharding implementation, MariaDB does have something called multi-source replication. We can setup one server which slaves to all existing shards and offers a consistent view across the entire sharded data set for certain types of queries.   If we utilize this feature it should be strictly for analytics/reporting capabilities and not used as part of the main Flow implementation.

Other options would be to move to a data store that gracefully and naturally handles our multi-master sharde. Other data stores may be better targeted for our use case, but there is also perhaps a significant knowledge gap not only with application development, but also supporting client libraries may need to be improved and operations would have to build out new standards with regards to running a different kind of data store cluster.

When is a shard chosen?
There are a few possibilities with regards to when, in the code, to chose a shard:


 * Session based sharding. In this case we look at the data in the incoming request and decide which shard it should run against.  With the required guarantee that all data the request is interested in is on a single shard this is by far the simplest to implement.


 * Transaction based sharding. Determines which shard to used based on the begining of a transaction.  Can be achieved either through an explicit API, or by parsing and extracting the shard id from the query.  Other schemes also include including comments in the SQL, but mediawiki does not, for the most part, deal directly with sql and instead uses the Database abstraction.


 * Statement based sharding. Determines which shard to used based on each individual statement.  By far the most complex, and the most versatile.

So, what should we do?
It may add complexity, but the only way to scale with regards to users and articles is to have two data stores for flow. One data store must be sharded by article, the other data store must be sharded by user_id. Flow will also make some small use of the global wiki db (the one from wfGetDB ).

To accomplish this is application code i propose we implement a simple DbFactory object which defaults to session based shard access, set once per request. This must be available to each component that wants to read/write the database(i'd prefer to not use a global, but thats how most of mediawiki handles this concern. Breaks a few rules, but from a practical standpoint it works)

The constructor should take an array mapping from minimum accepted shard value to the name of the cluster handling these values. We can instantiate two different DbFactory objects; one for the database sharded by user_id, and the other for the database sharded by page_id. These objects will be thin wrappers around wfGetLB returning the proper LB/DB associated with the named cluster.

Most requests within Flow will only need to access a single shard from each Factory, pre-chosen when the request is initialized. Both the User making the request and the page the flow is being constructed for are available at the very beginning of the request. We must initialize the DbFactorys with the correct sharding keys as early as possible and not muck around with changing shards in the middle of the request.

Completely independant from the DbFactory, to prevent 'accidentally' changing the active shard, we should create some class capable of querying multiple shards via a fan-out strategy. Its use should be discouraged for general querying. The primary use case i see for the fan out queries is inserting the same data for multiple users, perhaps

UPDATE user_subscriptions SET object_last_updated = :now where object_id = :some id

I dont think we should be using exactly the above query(for several reasons), its just an example of a query that would need to be run against all user shards to handle some sort of data update.

Initially we will use Echo for delivering users information about replys/changes/etc. but eventually we will move to the flow 'board', 'feed', or whatever name it eventually ends up with. This board will likely require pushing data about an object to multiple user-shards in a sort of 'inbox' strategy similar to Echo( alternatively we could take the Echo approach of vertical partitioning with no data retention? I would rather go the other way though, eventually moving Echo from vertical partitioning into the database horizontally partitioned by user_id but that has its own can of worms for another day).

It will be provided with the same map of minimum shard value to cluster name as the DbFactory, but provide some sort of fan-out abstraction. I'm uncertain at this point how the fan-out queries should operate. We really don't want to resort to serially accessing all shards, but we can not easily utilize MYSQLI_ASYNC yet.
 * Have pinged ops regarding the installation of ext/mysqlnd, they are receptive to the changes
 * Possible ways of moving forward include 5.3+mysqlnd, 5.4, or even 5.5. This decision will likely be made outside flow as it effects the full mediawiki ecosystem.
 * The beginings of Mysqli support have been merged to Database.php here: https://gerrit.wikimedia.org/r/#/c/62130/, actual support for Mysqli is being developed in branches linked from there.
 * This has impedance mis-match issues with async queries. We will need some bit of extended API to use the Database.php query building abilities with the async fan-out queries.
 * Does the fan out solution need to be able to determine which shards to query by parsing the query?
 * There are existing solutions within php to parse sql into some form of AST. They may be heavy to run, and will be difficult to cache unless we use bound parameters to normalize the query string vs escaping.
 * We will already be using mysqli to access async queries. We can utilize client side bound parameters, but we will lose the current query building functionalities of Database.php like IN generation which also performs parameter escaping(not that difficult to extract the functionality).

Caching possibilities
See Flow_Portal/Architecture/Memcache

Crazy Idea
One flow instance, many wikis. What if Flow was an independent wiki used as a service by other wikis. This is mostly brainstorming, probably too complex to tackle effectively. Ideas here could possibly propagate into the main implementation ideas if viable.

Benefits:
 * The same flow can be referred from any wiki. For example, commons and en.wiki
 * The database can be independent from any specific wiki
 * All wikis share benefit from work by ops team necessary to provide sharded database support.
 * It could (possibly) replace the mapping of url -> page giving flow full control of its URL structure.

Downsides:
 * cross-site javascript requests?
 * Difficulty for the project to be used in general mediawiki installs outside wikimedia?
 * Many, many more that I probably don't know yet

Questions:
 * How would you refer to a flow from a different wiki via markup?
 * Use standard interwiki links?
 * What do the urls look like for those flows? Do user browsers communicate directly with the central flow instance, or do users talk to the wiki and the wiki talks to flow?
 * If flow is independant, how does flow notify echo to generate notifications?

Inconsequential Things to Consider (or not)

 * use php namespaces?

Reference Material

 * en.wiki info about flow
 * Tumblr mysql sharding presentation
 * Ruby scripts for handling operations side of tumblr sharding