Flow/Architecture

Purpose of this Document
This document is intended to be an early brainstorming session regarding the first flow prototype. Final implementation will likely only tangentially look like what you see here now. In other words this is a pre-implementation brain dump which will be referenced while building the first prototype. In my experience with the ideas for internal architecture laid out the first implementation doesn't take long to actually convert into code.

Big Ideas
Flow is about workflow management. A "discussion" is a type of workflow - a very simple one.

Templates are bad, mmkay
In many cases templates are used to encourage workflow within local wikis. The goal for the various workflow models is to be dynamic enough to be managed from the wiki (by local administrators) to cover use cases currently handled by workflow suggestions inside templates. In other words flow will implement a whole bunch of Lego pieces, and the individual community's will stick them together into the various workflows they need.

WorkflowObject
Globally unique identifier of an individual instance of a workflow. A discussion topic is a FlowObject instance. A request for deletion of page Abc is a FlowObject instance. etc. This is the object the db would be horizontally sharded over.
 * ID - maybe GUID
 * Home wiki ?
 * Only needed if we decide to have a single combined flow that supports multiple wikis.
 * May not be needed even then.
 * Home page
 * Essentially where is this flow accessed from/was created
 * createDate
 * lastUpdateTime
 * creatorId
 * does this distinctly identify the user? likely needs to be combined with homeWiki to decide exactly which user this is
 * If we need two pieces of data to uniquely identify a user then any and all relationships (tags/revisions/etc) must also store those two pieces of data.
 * At a minimum we will need some way to map from a creatorId into a displayable username.
 * The common solution to this is to denormalize and store the users name in the record
 * Could we piggy back off of whatever the single sign on does?
 * title/name
 * Possibly tied into wikidata for automatic localization?
 * summaryText
 * lockState
 * Enumeration, not boolean. Allows for multiple states beyond locked/unlocked,
 * workflowModel
 * contentLanguage
 * Workflows may occur in different languages, this can help act as a filter for workflows a user can understand.

Considerations
For example, if an open (based on lockState?) request for deletion exists on a specific article, cannot open a new one. If a closed RFD exists for a page it should possibly be linked to a newly opened RFD for context. Various restrictions like this need to be definable in a generic way by the wiki community. The prototype implementation may hard code these types of considerations to be fleshed out at a later point.

ID Generation
First step to enabling horizontal scalability by removing dependence on a central sequence to give out id numbers. As we plan to use these ids within mysql for inserts they need to be constantly increasing, randomly generated ids cause performance issues in mysql's storage system.

UUID

 * MediaWiki contains a UIDGenerator class which generates either 88 or 128 bit uuids
 * uses file locking on the application server to guarantee uniqueness in non-random uuids
 * Can generate timestamp prefixed uuid's which fulfills the requirement for constantly increasing id values
 * It would be pointless to store the GUID's in their text representation, must be binary(16).
 * text->binary: UNHEX( '7b95ee9a9d8a4f3283e2c413cc764b2d' )
 * binary->text: HEX( binary_uuid_column )

Ticket Servers
http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

tl/dr: The Tickets64 schema looks like:

CREATE TABLE `Tickets64` (     `id` bigint(20) unsigned NOT NULL auto_increment,      `stub` char(1) NOT NULL default '',      PRIMARY KEY  (`id`),      UNIQUE KEY `stub` (`stub`)    ) ENGINE=MyISAM

SELECT * from Tickets64 returns a single row that looks something like:

+---+--+   | id                | stub | +---+--+   | 72157623227190423 |    a | +---+--+

When I need a new globally unique 64-bit ID I issue the following SQL:

REPLACE INTO Tickets64 (stub) VALUES ('a'); SELECT LAST_INSERT_ID;

To prevent SPOF they use this on multiple servers, with one chosen at random when an insert needs to be done, which serve up differing ids using mysql auto-increment-offset and auto-increment-increment settings.

Time + shard + autoincr
http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram

https://github.com/twitter/snowflake

tl/dr: Instagram uses combines 3 different values into a 64 bit number. Because the first 41 bits of tbe id is a timestamp the keys naturally increase and provide a logical sorting order across shards. Snowflake from twitter uses the same id generation method, but with worker_id rather than shard_id. Really its about the same thing.


 * 41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
 * 13 bits that represent the logical shard ID
 * 10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond

This is done with a stored procedure in postgresql in their version. Our setup with MariaDB as of 5.3 can serve up millisecond timestamps, and the REPLACE INTO method from can be used for id generation. If we wish to separate id generation from the database connection we could request ops to run the snowflake id generation service, but the added complexity of yet another service may not be necessary.

Other

 * The shard key most naturally fits as a hash of the article wiki + ns + title
 * Allows for multiple objects attached to the same wiki + ns + title to easily be loaded/saved at the same time

Proposed ID Generation Scheme
Because we want to shard on a hash of wiki+ns+title we need to either create an additional shard_id column, or we need to integrate the shard_id that we have chosen into the identifier. I propose we go with the latter, and use an approach like instagram and twitter snowflake to generating ids.

I propose we use a two layer scheme for generating shard ids:
 * 864 logical shards. Evenly divides into 2, 3, 4, 6, 8, 12, 16, etc. shards
 * Gives us 10 bits for logical shard, 12 bits for db sequence number, and 42 bits for timestamp(139 years in ms)
 * Logical shard determined via consistent hashing: crc32(json_encode(array( 'wiki'=>$wiki, 'nstext' => $nstext, 'text' => $titleText ) ) ) % 864
 * Logical shards are forever, they do not change
 * Logical shards mapped to Physical shards with a ring list
 * e.g. Server A serves 0-432, server B serves 433-864
 * Splitting a shard involves promoting 2 slaves to master with each receiving half(or really, an arbitrary portion) of the previous masters ring allocation.

Benefits:
 * Flow requests that are not for a users 'Flow' page(subscriptions) will have a the wiki+ns+title available in the request parameters making the choice of shard happen once at the beginning of the request
 * ID generation does not require any new servers, it uses an auto-increment field on the shard itself.
 * Shard ID, Creation Timestamp, and Unique Identifier compacted into a single 64 bit value
 * The object and any optional tables( summary, topic list, etc ) that have a 1 to 1 relationship with the object can have a single primary key rather than a more complex composite key which includes the shard_id
 * Unique: ID field supports 1024 new ids per ms per shard. A quick test against a laptop mariadb instance(so, not representative) generated 1 id per 2ms with a single client, and 1 id per 0.5ms with 16 clients(but each client waited ~6ms).  Basicaly indicates the format supports many  more unique id's than we can generate, leaving room for growth.
 * Available: If the shard you want to write to is available, then the ID generation service is also up. No mismatch.
 * Uncoordinated: Each shard is effectively its own id generation service.
 * Compact: Other systems use 128 bit ids which contain timestamp and some random info. If you need a shard id that isn't based on the main id you need 128bits + shard id bits. This uses 64bits for the timestamp and the shard id combined.
 * (Roughly) Time Ordered: (according to twitter snowflake) The id's are guaranteed to be k-sorted within 1s.
 * Directly Sortable: The ids should be sortable without loading the full objects that the represent. This sorting should be the above ordering.

Downsides:
 * A clock moving backwards on the database server could trigger insert errors with duplicate keys
 * When splitting shards have to ensure the old servers and new servers have sync'd clocks or you risk duplicates
 * Not too hard to just generate a new ID and try again, shouldn't happen very often
 * Could use the real twitter snowflake service which will stall id generation when it detects the clock moving backwards
 * It uses a slightly different balance of bits than proposed above, but likely easily patchable(or our ideas can change even easier0
 * Ties ID's to current architecture?

I think the benefits outway the downsides, other opinions?

WorkflowMetadata
Each flow object maps to some type of workflow metadata. The metadata is specific to the type of workflow that is being worked with. The specific type of metadata to expect (and therefore the type of Metadata and Model objects to use) is defined by the workflowModel field of the FlowObject.

The various models need to be programmable such that local wikis can use them as they need and not be locked into pre-programmed ideas. Templates currently allow a great deal of flexibility with no automated enforcement. These models must represent a middle ground between template flexibility and pre-programmed strict workflows.


 * 1 to Many mapping between FlowObject and the Workflow Metadata rows
 * Each row of the metadata is a key/value pair and the object id
 * Metadata needs to be rather abstract to support all the different use cases. Likely it could be implemented like user options are currently.
 * Collecting all the metadata for an object would be something like SELECT metadata_key,metadata_value FROM flow_metadata where metadata_object_id = '12345'

WorkflowModel
Each wiki will be able to define their own models. In the initial prototype the discussion model will be hardcoded implementation, but the model storage will store the info needed to turn the wiki's model definition into an instantiated object capable of understanding the metadata and performing actions.

For wiki defined models there are at least two options:
 * Real wiki page containing the definition in the WDL format
 * Special page and corresponding model table in database

Possible workflows to support include but are not limited to:
 * 2 way user conversation (user talk page owner<->talker)
 * 3 way user conversation (talk page any<->any)
 * Request for deletion
 * Request for adminship
 * General consensus discussions
 * Village Pump, Forum, etc.
 * AN/I
 * Help desk
 * Barnstars/Wikilove (and other templates of this variety)
 * Block Notices (you've been blocked, click this button to appeal)
 * moar

Special page implementation could offer a hook for models to provide options needed to configure the model on wiki. A Db table can then store the configured model to a unique name allowing it to be used as a model on any articles flow. A model should not be limited to a single type of support object. It should be able to string together models. For example the Discussion is a set of Topics each containing a title, an optional Summary and a reply tree. The implementation should be generic enough so each topic can be its own subscribable workflow object, gaining subscriptions, tagging, and other properties of the object. The implementation must gracefully handle updates to the topics triggering notifications at the discussion level.

FlowDiscussion

 * 0 or 1 Summary
 * 1 to Many: FlowPostSingular

FlowRequestForDeletion

 * 1 Summary (Reason for request)
 * 1 to Many: FlowEnumeratedLines

FlowBlockNotice

 * 1 Summary (Block Reason)
 * Functional Elements
 * Button for 'appeal this block'
 * Completely dynamic and described by a 'Workflow Description Language' but not a part of the initial prototype implementation.
 * 1 to Many: FlowPostSingular

WorkflowAction
This typically represents an action performed by a single user. In some cases like the Summary object there is a single object representing all actions, but internally it will reference the Revisions table which maintains the single-action relationship. In general cases this could be a reply to a message or a vote in a consensus discussion, etc. There will typically be a 1 to Many relationship between a WorkflowObject and WorkflowActions as defined by the WorkflowModel.

FlowSummary
Summary objects are 1 to 1 with a particular model, allowing for arbitrary content to be displayed with the models that require a summary.
 * objectId
 * text
 * comment

FlowEnumeratedLines

 * objectId
 * Enum value (e.g. vote yes/no, etc.)
 * text
 * comment

FlowPostSingular

 * objectId
 * There are going to be so many posts that it may be better to link from the flow object metadata to the root of a discussion rather than linking all posts back to their object directly. Finding out which topic a post belongs to would then be a 2 step process of finding the root node related to the post and then finding the metadata that links to that root post.
 * If sharding by object_id, it may be required for consistency sake to always include the object_id.
 * createdByUserId
 * replyToFlowPostSingularId?
 * revision
 * content
 * summary
 * etc.

FlowPostSingular is really a DAG (directed acyclic graph). Alternatively a tree is a special case of a DAG that likely works in this use case.
 * Are there any benefits to using a DAG over a tree?
 * Probably not, in comparison to the complexity in SQL.
 * A tree is an undirected graph in which any two vertices are connected by exactly one simple path.

Possible methods of storing trees in sql:


 * Moved to Flow_Portal/Architecture/Discussion_Storage

Revision
Posts, Summarys, and possibly other content within Flow needs to be revisioned much like normal wiki pages. Ideally we want to reuse the existing revisions code that is already written. Revisions within flow will be slightly different from revisions of a wiki page though, most notably wiki revisions store a full pages content, but we will have many small fragments that should be individually stored.

The simplest way to store all these fragments would be to create a wiki page for each possible FlowObject. The wiki page's revision content would then be a multi-part content type that stores the various fragments.
 * In an ideal world we should use a multi-part content type that allows us to pull individual fragments out of the document using simple string indexing, rather than having to parse the content of the entire revision before being able to pluck out just the fragments which are interesting to us.
 * The parsoid team will be starting on something along these lines in July: https://bugzilla.wikimedia.org/show_bug.cgi?id=49143

What if we stored flow revisions in the master database in the standard Revisions table
 * Flow revisions don't map directly to a pageId or Title object, we would need a fake page (probably in a custom namespace thats not user editable)
 * The total number of new revisions per day/week/etc shouldn't change very much versus existing workflows that also edit wiki pages directly
 * Currently every new post or summary update is already a new revision
 * May or may not be relevant, but Flow will most likely be storing the annotated html from Parsoid rather than wiki text that the rest of the core revisions table remembers.

What if we re-implement the bare minimum of revision functionality that flow needs in a new object?
 * Any future upgrades/bugfixes/etc to the core revision object will also have to be duplicated into the flow revision, if we want similar functionality.
 * There are many years of built up bug fixes and knowledge within the current Revision implementation, ideally we want to benefit from that instead of re-implementing.
 * In other words: Not a great idea

Could we reuse most of the existing Revision functionality without being tied to the master db revision table?
 * Could just store everything in the main database
 * For the prototype, this is probably the best bet. Moving forward we may want to consider moving the flow revision storage into the same db cluster as flow.
 * Could Extend the Revision object into our own which replaces the static methods that choose which database to use
 * Could just be careful when loading/storing revisions such that we always pass in what database we want to work with

What happens to content the article maintainers wish to display above a talk page, like guidelines or whatever?
This ties into the ability to string together multiple Action's within a single model. The top action in a discussion is a Summary, which can be then be treated as a "unstructured" area to be filled with whatever content the article maintainers wish to put there.

Possibly within the discussion, as opposed to at the very top, users may wish to have a scratchpad where multiple users can tweak a bit of wiki text that is intended to be added to the article being discussed or some such. This should be gracefully handled.

Where does a post's content actually get stored?
FlowPostSingular should probably just be metadata about the post. The literal post content should likely be stored in either wiki markup or the parsoid DOM representation of the wiki markup. It should be stored in a separate table comparable to revisions to facilitate moderation requirements like tweaking the content of a post.

Previous Implementations
In LiquidThreads the posts are stored as literal wiki pages within the main wiki master database
 * This is not scalable. Loading more and more things onto a singular master database just spells doom and gloom for everyone.
 * Horribly slow. Pages are parsed on the way out every tim
 * The individual pages shouldn't be parsed every time, the parser cache should hold onto the html for 30 days.
 * If LT is just reading from the parser cache, and its way too slow, that means simply that we cannot store the html per posting in sql. Some sort of optimizations would need to be applied that cache at a higher level
 * It is counter intuitive that the parser cache wouldn't work here, some investigation is necessary into what exactly makes LQT slow.

Moving Forward
Storing post data as pre-parsed HTML may be faster to assemble on the fly
 * Is this basically what parsoids output is? Can we feed that html back into VE for new revisions of the content?

There seems to be basically 2 options for storing the content(not just for posts, but for any WorkflowAction that has user definable wiki content)
 * Continue using the text and revision tables in the master database
 * Create new text and revision tables in the flow database
 * See Flow_Portal/Architecture

A Users Flow (or feed of interesting things)
All flow objects will be subscribe-able by any number of users Generating the users feed is a sort on the subscribed flow objects ordered by last updated date
 * Independant subscription table implementing a Many to One relationship between flow objects and users
 * SQL is going to hate you, mysql cant answer this with a single index
 * Real answer: Echo
 * In the current UI prototype a users feed is independent from the users notifications
 * Echo already has everything in place for generating a flow, we just need to fire the events so echo knows about them and remembers them

Remembering what a user has already seen
In addition to individual objects being subscribable, to generate feeds matching the current UI prototype we also need to know what WorkflowModel's have been previously seen by the user
 * Remembering a boolean true seen status for every model instance multiplied by all editors is a ton of data and seems sub-optimal.
 * Could remember the last date a user viewed items from that WorkflowObject within the subscription and only display items newer than that
 * Each topic within a discussion is its own WorkflowObject, so the memory is more granular than just the main page, but is it enough?

Suppressed Revisions
Need comparable functionality
 * Could use real wiki pages ala LiquidThreads
 * Could implement a 'work alike' along with some sort of interface to generalize the current suppressed revision code (likely rather difficult)

Search
Handled user side in javascript, in the backend, or both (probably both)?
 * Hooking into lucene extension?
 * Hooking into solr cloud?
 * What elements do the UI call for in search results?

Tags
Individual discussion topics must be taggable. How should the tagging implementation work?
 * Tags can be public or private
 * Should public and private tags be stored in the same table?
 * Conceptually it might be simpler to store them separately.
 * Storing the separately also provides a much stronger guarantee of not accidently displaying private tags
 * If sharding by object_id, where to store table mapping tag_name to object_id? Don't want to have to query every shard.

Possible caching ideas / issues to keep in mind

 * In an ideal world can fetch the rendered HTML fragments without having to ask the main db about the actual content
 * Can store the adjacency list for each top level post (FlowPostSingular with no parent id) in memcache to prevent the recursive query inside db

So you want to find a talk page
MediaWiki should work great without flow. After installing the flow extension it should continue to work great and the talk pages will be flow discussions.

Currently you take the title of the page and look it up within the NS_TALK namespace. With flow we need something different, but conceptually similar?

Old talk pages: should the current urls still point to the old talk pages, or should they move and flow replaces them on those urls?
 * Current Talk pages comments likely(?) link to Talk:Something directly. For best results they should go to the new flow discussions?
 * If so, then there is also Talk:Something and we would like to continue pointing to the correct data.
 * Same concern as the 2 points above, but with urls from the internet at large. do talk pages get linked directly from outside with any frequency?

Talk page urls:
 * An article in NS_MAIN: /wiki/Talk:Volcano
 * An article in any other NS: /wiki/NameOfNamespace_Talk:Volcano

Kiss flow urls for first prototype: * /wiki/Special:Flow/Talk?title=NameOfNamespace:Volcano

Possible Flow urls? Not currently supported in core.
 * An article in NS_MAIN: /wiki/Flow:Volcano/Talk
 * An article in any other NS: /wiki/NameOfNamespace_Flow:Volcano/Talk

Mapping URL's to Flow objects
The path Flow:Volcano/Talk should not be some super special case, there should be some way to attach specific types of flows to default paths. Visiting the page when nothing currently exists must work much like the current system, where the user is given the opportunity to create that specific thing. While we may initially hard-code /Talk into the flow prototype, it would be much better if the mappings from a name to some defaulted type of flow object is managed on wiki by the community. Additionally i18n and l10n considerations need to be taken into account as not every wiki uses the word 'Talk' for their talk pages. Mostly this matters in relation to GUID generation and what we use as the GUID namespace/name.

How to move forward? unsure at this juncture. Possibly the prototype will defer this solution for later and hardcode.

Performance Considerations
How much data will flow need to store? For estimation purposes, Wikipedia Statistics shows there are approximatly 26M articles across all wiki's. Not all of these will have talk page, but many will. Assuming they all have talk pages ranging from just a post or two, to a couple thousand posts on the largest, we should expect a lower bound of perhaps 100M individual replies will need to be stored in perhaps 20M seperate discussion graphs. If each reply consumes 1kb of space that puts a lower bound of at least 100GB of post data.

And that is just the discussions, flow will need to handle many more workflows than just discussions.

To help get an idea of the space required i applied the EchoDiscussionParser to one of the enwiki database dumps ( enwiki-latest-pages-meta-current10.xml-p000925001p001325000 ). Within this file it detected:


 * 43952 pages in either Talk: or *_Talk: namespaces
 * 211904 individual section headers
 * 514916 user signatures

That works out to an average of:


 * 5 sections per talk page
 * 2.5 signatures per section
 * 12 signatures per talk page

These pages were of course built up over time, but give us a general idea of the size of the problem we need to handle. It may be worthwhile to re-run this code and split the stats between article talk pages and user talk pages. There will be several orders of magnitude more article talk pages than user talk pages, so if they have different characteristics that may be useful information.

Horizontally sharding the master database
With GUIDs we could, for example, divide the GUID key space into some future-proof number of buckets (4096?) At first all buckets could be assigned to a single master db, but as things grow buckets could be distributed across multiple masters. Other options for sharding include load balanced sequential id generation which is conceptually simpler

Sharding key
The most natural key choices for flow are
 * the object id
 * a hash of wiki+ns+title

Sharding schemes

 * ranges - assign different ranges of ids to different servers. With GUID's this is reasonable, with seq id's not so much
 * hash or modulus - preferred when using sequential ids, gives fairly even distribution. Downsides include being fairly difficult to add new shards or rebalance existing shards

Number of Shards
How many shards to start with? Difficulties in rebalancing the shards would suggest we should start with a somewhat future proof number of shards. Initially the shards can be co-located within the same MySQL instance, as future storage/IO requirements require some of the shards can be re-located onto independant masters.

Shard Colocation
Do shards coexist within a DB schema(same table), a MySQL instance(different table same server), or a physical machine (multiple mysql instances per machine)?
 * Upon initial review co-locating shards within the same mysql instance but in independant tables may provide the most natural forward progression in terms of redistributing shards between servers.

Data that doesn't naturally shard

 * When sharding by the object id a user's flow (timeline essentially?) cannot be easily built without querying all possible masters. Long tail distribution basically says this will always be slow, especially since php does not(easily) support async queries.
 * For this reason it would be preferably to build on top of what Echo already provides in terms of pre-building events to display to a user. If we generate the events as they happen to be inserted into a users notifications then the query is localized to a single server.
 * Similar concerns with regards to tags, lookups of all tags on a specific object will hit a single shard, but lookups by tag name will have to hit all shards.

The sharding implementation
Does MySQL offer anything that makes our job easier?
 * MySQL Cluster?
 * There are no durable commits on disk. Commits are replicated, but there is no guarantee that logs are flushed to disk on commit.
 * External to MYSQL, there are horizontally sharding proxies like HarvestDB(not OSS, uses custom protocol), twitters Gizzard(OSS but no longer maintained), CitusDB(not OSS), and Spock Proxy(OSS, no updates since 2009).
 * Internal to standard MySQL AFAIK it does not currently, we would need an application side sharding implementation. Simplest would be for anything in flow that needs a database to request the database associated with a particular

In general, it seems that horizontal sharding will introduce some serious complexity to things. With the main goal at this point to be getting something running it may be best to not shard at this time, but keep it in mind so that in the future if it is required a transition to sharding will not be a show stopper.

Other options would be to move to a data store that gracefully and naturally handles our conditions. Other data stores may be better targeted for our use case, but there is also perhaps a significant knowledge gap not only with application development, but also supporting client libraries may need to be improved and operations would have to build out new standards with regards to running a different kind of data store cluster.

Caching possibilities
Too early to really know anything about what this will look like. Profile, Review, Cache as required.

Crazy Idea
One flow instance, many wikis. What if Flow was an independent wiki used as a service by other wikis. This is mostly brainstorming, probably too complex to tackle effectively. Ideas here could possibly propagate into the main implementation ideas if viable.

Benefits:
 * The same flow can be referred from any wiki. For example, commons and en.wiki
 * The database can be independent from any specific wiki
 * All wikis share benefit from work by ops team necessary to provide sharded database support.
 * It could (possibly) replace the mapping of url -> page giving flow full control of its URL structure.

Downsides:
 * cross-site javascript requests?
 * Difficulty for the project to be used in general mediawiki installs outside wikimedia?
 * Many, many more that I probably don't know yet

Questions:
 * How would you refer to a flow from a different wiki via markup?
 * Use standard interwiki links?
 * What do the urls look like for those flows? Do user browsers communicate directly with the central flow instance, or do users talk to the wiki and the wiki talks to flow?
 * If flow is independant, how does flow notify echo to generate notifications?

Inconsequential Things to Consider (or not)

 * use php namespaces?
 * one class per file?

Reference Material

 * en.wiki info about flow
 * Tumblr mysql sharding presentation