Flow/Architecture

Purpose of this Document
This document is intended to be an early brainstorming session regarding the first flow prototype. Final implementation will likely only tangentially look like what you see here now. In other words this is a pre-implementation brain dump which will be referenced while building the first prototype. In my experience with the ideas for internal architecture laid out the first implementation doesn't take long to actually convert into code.

Big Ideas
Flow is about workflow management. A "discussion" is a type of workflow - a very simple one.

Templates are bad, mmkay
In many cases templates are used to encourage workflow within local wikis. The goal for the various workflow models is to be dynamic enough to be managed from the wiki (by local administrators) to cover use cases currently handled by workflow suggestions inside templates. In other words flow will implement a whole bunch of Lego pieces, and the individual community's will stick them together into the various workflows they need.

WorkflowObject
Globally unique identifier of an individual instance of a workflow. A discussion topic is a FlowObject instance. A request for deletion of page Abc is a FlowObject instance. etc. This is the object the db would be horizontally sharded over.
 * ID - maybe GUID
 * Home wiki ?
 * Only needed if we decide to have a single combined flow that supports multiple wikis.
 * May not be needed even then.
 * Home page
 * Essentially where is this flow accessed from/was created
 * createDate
 * lastUpdateTime
 * creatorId
 * does this distinctly identify the user? likely needs to be combined with homeWiki to decide exactly which user this is
 * If we need two pieces of data to uniquely identify a user then any and all relationships (tags/revisions/etc) must also store those two pieces of data.
 * At a minimum we will need some way to map from a creatorId into a displayable username.
 * Could have a table that maps username + wiki to an internal id.
 * If the database is sharded then over time it is very likely that every possible user will have an entry on every shard, is that reasonable to allow?
 * Is it confusing that the user may have different internal ids on different shards?
 * Could we piggy back off of whatever the single sign on does?
 * title/name
 * Possibly tied into wikidata for automatic localization?
 * summaryText
 * lockState
 * Enumeration, not boolean. Allows for multiple states beyond locked/unlocked,
 * workflowModel
 * contentLanguage
 * Workflows may occur in different languages, this can help act as a filter for workflows a user can understand.

Considerations
For example, if an open (based on lockState?) request for deletion exists on a specific article, cannot open a new one. If a closed RFD exists for a page it should possibly be linked to a newly opened RFD for context. Various restrictions like this need to be definable in a generic way by the wiki community. The prototype implementation may hard code these types of considerations to be fleshed out at a later point.

Why GUID?

 * First step to enabling horizontal scalability by removing dependence on a central sequence to give out id numbers
 * Assuming somewhat even distribution of GUIDs can bucket and horizontally shard by GUID (more on this towards the end of the document)
 * The flow object and flow metadata will be used more like a key/value store mapping from GUID to the appropriate hash of information. The relational aspect is limited to the WorkflowModel's
 * On the other hand, mysql reportedly doesn't work well with uuid's as primary keys. Because the rows are stored as a clustered index on the primary key mysql works best with sequentially ordered data.
 * Not decided yet, could certainly use standard auto-increment ids

If we do use GUIDs
 * It would be pointless to store the GUID's in their text representation, must be binary(16).
 * text->binary: UNHEX( '7b95ee9a9d8a4f3283e2c413cc764b2d' )
 * binary->text: HEX( binary_uuid_column )
 * Puting the - in the right place is a just an output issue
 * Can most of the tables within flow can still use auto increment ids?
 * Any reference to an auto increment id must also be accompanied by the GUID it is related to as they will not be unique across the shards.
 * If we also allow auto-increment ids, what happens when redistributing buckets to a new database server?
 * Is it as simple as just ensuring the auto increment counter for each table on the new db matches the previous bucket location?
 * Sounds more complex than necessary

Alternatives to GUIDs in a horizontally sharded database
 * http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
 * tl/dr; use a single row table on 2 load balanced servers giving out odd/even numbers from a sequential primary key. Before inserting any data you would query the id service for a new id, then insert to the sharded db.

WorkflowMetadata
Each flow object maps to some type of workflow metadata. The metadata is specific to the type of workflow that is being worked with. The specific type of metadata to expect (and therefore the type of Metadata and Model objects to use) is defined by the workflowModel field of the FlowObject.

The various models need to be programmable such that local wikis can use them as they need and not be locked into pre-programmed ideas. Templates currently allow a great deal of flexibility with no automated enforcement. These models must represent a middle ground between template flexibility and pre-programmed strict workflows.


 * 1 to Many mapping between FlowObject and the Workflow Metadata rows
 * Each row of the metadata is a key/value pair and the object id
 * Metadata needs to be rather abstract to support all the different use cases. Likely it could be implemented like user options are currently.
 * Collecting all the metadata for an object would be something like SELECT metadata_key,metadata_value FROM flow_metadata where metadata_object_id = '12345'

WorkflowModel
Each wiki will be able to define their own models. In the initial prototype the discussion model will be hardcoded implementation, but the model storage will store the info needed to turn the wiki's model definition into an instantiated object capable of understanding the metadata and performing actions.

For wiki defined models there are at least two options:
 * Real wiki page containing the definition in the WDL format
 * Special page and corresponding model table in database

Possible workflows to support include but are not limited to:
 * 2 way user conversation (user talk page owner<->talker)
 * 3 way user conversation (talk page any<->any)
 * Request for deletion
 * Request for adminship
 * General consensus discussions
 * Village Pump, Forum, etc.
 * AN/I
 * Help desk
 * Barnstars/Wikilove (and other templates of this variety)
 * Block Notices (you've been blocked, click this button to appeal)
 * moar

Special page implementation could offer a hook for models to provide options needed to configure the model on wiki. A Db table can then store the configured model to a unique name allowing it to be used as a model on any articles flow. A model should not be limited to a single type of support object. It should be able to string together models. For example the Discussion is a set of Topics each containing a title, an optional Summary and a reply tree. The implementation should be generic enough so each topic can be its own subscribable workflow object, gaining subscriptions, tagging, and other properties of the object. The implementation must gracefully handle updates to the topics triggering notifications at the discussion level.

FlowDiscussion

 * 0 or 1 Summary
 * 1 to Many: FlowPostSingular

FlowRequestForDeletion

 * 1 Summary (Reason for request)
 * 1 to Many: FlowEnumeratedLines

FlowBlockNotice

 * 1 Summary (Block Reason)
 * Functional Elements
 * Button for 'appeal this block'
 * Completely dynamic and described by a 'Workflow Description Language' but not a part of the initial prototype implementation.
 * 1 to Many: FlowPostSingular

WorkflowAction
This typically represents an action performed by a single user. In some cases like the Summary object there is a single object representing all actions, but internally it will reference the Revisions table which maintains the single-action relationship. In general cases this could be a reply to a message or a vote in a consensus discussion, etc. There will typically be a 1 to Many relationship between a WorkflowObject and WorkflowActions as defined by the WorkflowModel.

FlowSummary
Summary objects are 1 to 1 with a particular model, allowing for arbitrary content to be displayed with the models that require a summary.
 * objectId
 * text
 * comment

FlowEnumeratedLines

 * objectId
 * Enum value (e.g. vote yes/no, etc.)
 * text
 * comment

FlowPostSingular

 * objectId
 * There are going to be so many posts that it may be better to link from the flow object metadata to the root of a discussion. Finding out which topic a post belongs to would then be a 2 step process of finding the root node related to the post and then finding the metadata that links to that root post.
 * createdByUserId
 * replyToFlowPostSingularId?
 * revision
 * content
 * summary
 * etc.

FlowPostSingular is really a DAG (directed acyclic graph). Alternatively a tree is a special case of a DAG that likely works in this use case.
 * Are there any benefits to using a DAG over a tree?
 * Probably not, in comparison to the complexity in SQL.
 * A tree is an undirected graph in which any two vertices are connected by exactly one simple path.

Possible methods of storing trees in sql:


 * Moved to Flow_Portal/Architecture/Discussion_Storage

What happens to content the article maintainers wish to display above a talk page, like guidelines or whatever?
This ties into the ability to string together multiple Action's within a single model. The top action in a discussion is a Summary, which can be then be treated as a "unstructured" area to be filled with whatever content the article maintainers wish to put there.

Possibly within the discussion, as opposed to at the very top, users may wish to have a scratchpad where multiple users can tweak a bit of wiki text that is intended to be added to the article being discussed or some such. This should be gracefully handled.

Where does a post's content actually get stored?
FlowPostSingular should probably just be metadata about the post. The literal post content should likely be stored in either wiki markup or the parsoid DOM representation of the wiki markup. It should be stored in a separate table comparable to revisions to facilitate moderation requirements like tweaking the content of a post.


 * In LiquidThreads the posts are stored as literal wiki pages within the main wiki master database
 * This is not scalable. Loading more and more things onto a singular master database just spells doom and gloom for everyone.
 * Horribly slow. Pages are parsed on the way out every tim
 * The individual pages shouldn't be parsed every time, the parser cache should hold onto the html for 30 days.
 * If LT is just reading from the parser cache, and its way too slow, that means simply that we cannot store the html per posting in sql. Some sort of optimizations would need to be applied that cache at a higher level
 * It is counter intuitive that the parser cache wouldn't work here, some investigation is necessary into what exactly makes LQT slow.


 * Duplicate the wiki page tables structure into an independant master
 * Can the Revision class be extended and reused in such manner? Cursory look says its probably possible with only minor adjustments to the existing class to use late static binding.
 * Potentially dangerous if running the unclustered configuration and any of the table references fails to be updated
 * revisions in mw link to the page and user table, but revisions in flow will be referred to from more than one database type in flow(summarys, comments, etc.) and will not have its own users table


 * Storing post data as pre-parsed HTML may be faster to assemble on the fly

A Users Flow (or feed of interesting things)
All flow objects will be subscribe-able by any number of users Generating the users feed is a sort on the subscribed flow objects ordered by last updated date
 * Independant subscription table implementing a Many to One relationship between flow objects and users
 * SQL is going to hate you, mysql cant answer this with a single index
 * Real answer: Echo
 * In the current UI prototype a users feed is independent from the users notifications
 * Echo already has everything in place for generating a flow, we just need to fire the events so echo knows about them and remembers them

Remembering what a user has already seen
In addition to individual objects being subscribable, to generate feeds matching the current UI prototype we also need to know what WorkflowModel's have been previously seen by the user
 * Remembering a boolean true seen status for every model instance multiplied by all editors is a ton of data and seems sub-optimal.
 * Could remember the last date a user viewed items from that WorkflowObject within the subscription and only display items newer than that
 * Each topic within a discussion is its own WorkflowObject, so the memory is more granular than just the main page, but is it enough?

Suppressed Revisions
Need comparable functionality
 * Could use real wiki pages ala LiquidThreads
 * Could implement a 'work alike' along with some sort of interface to generalize the current suppressed revision code (likely rather difficult)

Search
Handled user side in javascript, in the backend, or both (probably both)?
 * Hooking into lucene extension?
 * For user side to be useful we will need to include collapsed post content in the resulting html page, simply hiding that info via css.

Tags
Individual discussion topics must be taggable. How should the tagging implementation work?
 * Tags can be public or private
 * Should public and private tags be stored in the same table?
 * Conceptually it might be simpler to store them separately.
 * Storing the separately also provides a much stronger guarantee of not accidently displaying private tags

Possible caching ideas / issues to keep in mind

 * In an ideal world can fetch the rendered HTML fragments without having to ask the main db about the actual content
 * Can store the adjacency list for each top level post (FlowPostSingular with no parent id) in memcache to prevent the recursive query inside db

So you want to find a talk page
MediaWiki should work great without flow. After installing the flow extension it should continue to work great and the talk pages will be flow discussions.

Currently you take the title of the page and look it up within the NS_TALK namespace. With flow we need something different, but conceptually similar?

Old talk pages: should the current urls still point to the old talk pages, or should they move and flow replaces them on those urls?
 * Current Talk pages comments likely(?) link to Talk:Something directly. For best results they should go to the new flow discussions?
 * If so, then there is also Talk:Something and we would like to continue pointing to the correct data.
 * Same concern as the 2 points above, but with urls from the internet at large. do talk pages get linked directly from outside with any frequency?

Talk page urls:
 * An article in NS_MAIN: /wiki/Talk:Volcano
 * An article in any other NS: /wiki/NameOfNamespace_Talk:Volcano

Kiss flow urls for first prototype: * /wiki/Special:Flow/Talk?title=NameOfNamespace:Volcano

Possible Flow urls? Not currently supported in core.
 * An article in NS_MAIN: /wiki/Flow:Volcano/Talk
 * An article in any other NS: /wiki/NameOfNamespace_Flow:Volcano/Talk

Mapping URL's to Flow objects
The path Flow:Volcano/Talk should not be some super special case, there should be some way to attach specific types of flows to default paths. Visiting the page when nothing currently exists must work much like the current system, where the user is given the opportunity to create that specific thing. While we may initially hard-code /Talk into the flow prototype, it would be much better if the mappings from a name to some defaulted type of flow object is managed on wiki by the community. Additionally i18n and l10n considerations need to be taken into account as not every wiki uses the word 'Talk' for their talk pages. Mostly this matters in relation to GUID generation and what we use as the GUID namespace/name.

How to move forward? unsure at this juncture. Possibly the prototype will defer this solution for later and hardcode.

Performance Considerations
How much data will flow need to store? For estimation purposes, Wikipedia Statistics shows there are approximatly 26M articles across all wiki's. Not all of these will have talk page, but many will. Assuming they all have talk pages ranging from just a post or two, to a couple thousand posts on the largest, we should expect a lower bound of perhaps 100M individual replies will need to be stored in perhaps 20M seperate discussion graphs. If each reply consumes 1kb of space that puts a lower bound of at least 100GB of post data.

And that is just the discussions, flow will need to handle many more workflows than just discussions.

To help get an idea of the space required i applied the EchoDiscussionParser to one of the enwiki database dumps ( enwiki-latest-pages-meta-current10.xml-p000925001p001325000 ). Within this file it detected:


 * 43952 pages in either Talk: or *_Talk: namespaces
 * 211904 individual section headers
 * 514916 user signatures

That works out to an average of:


 * 5 sections per talk page
 * 2.5 signatures per section
 * 12 signatures per talk page

These pages were of course built up over time, but give us a general idea of the size of the problem we need to handle. It may be worthwhile to re-run this code and split the stats between article talk pages and user talk pages. There will be several orders of magnitude more article talk pages than user talk pages, so if they have different characteristics that may be useful information.

Horizontally sharding the master database
With GUIDs we could, for example, divide the GUID key space into some future-proof number of buckets (4096?) At first all buckets could be assigned to a single master db, but as things grow buckets could be distributed across multiple masters. Other options for sharding include load balanced sequential id generation which is conceptually simpler

Sharding key
The most natural key to shard on in flow is the object id.

Sharding schemes

 * ranges - assign different ranges of ids to different servers. With GUID's this is reasonable, with seq id's not so much
 * hash or modulus - preferred when using sequential ids, gives fairly even distribution. Downsides include being fairly difficult to add new shards or rebalance existing shards

Number of Shards
How many shards to start with? Difficulties in rebalancing the shards would suggest we should start with a somewhat future proof number of shards. Initially the shards can be co-located within the same MySQL instance, as future storage/IO requirements require some of the shards can be re-located onto independant masters.

Shard Colocation
Do shards coexist within a DB schema(same table), a MySQL instance(different table same server), or a physical machine (multiple mysql instances per machine)?
 * Upon initial review co-locating shards within the same mysql instance but in independant tables may provide the most natural forward progression in terms of redistributing shards between servers.

Data that doesn't naturally shard

 * When sharding by the object id a user's flow (timeline essentially?) cannot be easily built without querying all possible masters. Long tail distribution basically says this will always be slow, especially since php does not(easily) support async queries.
 * For this reason it would be preferably to build on top of what Echo already provides in terms of pre-building events to display to a user. If we generate the events as they happen to be inserted into a users notifications then the query is localized to a single server.
 * Similar concerns with regards to tags, lookups of all tags on a specific object will hit a single shard, but lookups by tag name will have to hit all shards.

The sharding implementation
Does MySQL offer anything that makes our job easier?
 * MySQL Cluster?
 * There are no durable commits on disk. Commits are replicated, but there is no guarantee that logs are flushed to disk on commit.
 * External to MYSQL, there are horizontally sharding proxies like HarvestDB(not OSS, uses custom protocol), twitters Gizzard(OSS but no longer maintained), CitusDB(not OSS), and Spock Proxy(OSS, no updates since 2009).
 * Internal to standard MySQL AFAIK it does not currently, we would need an application side sharding implementation. Simplest would be for anything in flow that needs a database to request the database associated with a particular

In general, it seems that horizontal sharding will introduce some serious complexity to things. With the main goal at this point to be getting something running it may be best to not shard at this time, but keep it in mind so that in the future if it is required a transition to sharding will not be a show stopper.

Other options would be to move to a data store that gracefully and naturally handles our conditions. Other data stores may be better targeted for our use case, but there is also perhaps a significant knowledge gap not only with application development, but also supporting client libraries may need to be improved and operations would have to build out new standards with regards to running a different kind of data store cluster.

Caching possibilities
Too early to really know anything about what this will look like. Profile, Review, Cache as required.

Crazy Idea
One flow instance, many wikis. What if Flow was an independent wiki used as a service by other wikis. This is mostly brainstorming, probably too complex to tackle effectively. Ideas here could possibly propagate into the main implementation ideas if viable.

Benefits:
 * The same flow can be referred from any wiki. For example, commons and en.wiki
 * The database can be independent from any specific wiki
 * All wikis share benefit from work by ops team necessary to provide sharded database support.
 * It could (possibly) replace the mapping of url -> page giving flow full control of its URL structure.

Downsides:
 * cross-site javascript requests?
 * Difficulty for the project to be used in general mediawiki installs outside wikimedia?
 * Many, many more that I probably don't know yet

Questions:
 * How would you refer to a flow from a different wiki via markup?
 * Use standard interwiki links?
 * What do the urls look like for those flows? Do user browsers communicate directly with the central flow instance, or do users talk to the wiki and the wiki talks to flow?
 * If flow is independant, how does flow notify echo to generate notifications?

Inconsequential Things to Consider (or not)

 * use php namespaces?
 * one class per file?

Reference Material

 * en.wiki info about flow
 * Tumblr mysql sharding presentation