Flow/Architecture

Purpose of this Document
This document is intended to be an early brainstorming session regarding the first flow prototype. Final implementation will likely only tangentially look like what you see here now. In other words this is a pre-implementation brain dump which will be referenced while building the first prototype. In my experience with the ideas for internal architecture laid out the first implementation doesn't take long to actually convert into code.

DISCLAIMER: All decisions, especially in regards to storage layout, are fluid. We fully expect to delete the data created by the prototype on extensiondb1 before deploying the final solution.

Big Ideas
Flow is about workflow management. A "discussion" is a type of workflow - a very simple one.

Templates are bad, mmkay
In many cases templates are used to encourage workflow within local wikis. The goal for the various workflow models is to be dynamic enough to be managed from the wiki (by local administrators) to cover use cases currently handled by workflow suggestions inside templates. In other words flow will implement a whole bunch of Lego pieces, and the individual community's will stick them together into the various workflows they need.

WorkflowObject
Globally unique identifier of an individual instance of a workflow. A discussion topic is a FlowObject instance. A request for deletion of page Abc is a FlowObject instance. etc.


 * ID - maybe UUID
 * Home wiki ?
 * Only needed if we decide to have a single combined flow that supports multiple wikis.
 * May not be needed even then.
 * Home page
 * Essentially where is this flow accessed from/was created
 * createDate
 * lastUpdateTime
 * creatorId
 * does this distinctly identify the user? likely needs to be combined with homeWiki to decide exactly which user this is
 * If we need two pieces of data to uniquely identify a user then any and all relationships (tags/revisions/etc) must also store those two pieces of data.
 * At a minimum we will need some way to map from a creatorId into a displayable username.
 * The common solution to this is to denormalize and store the users name in the record
 * Could we piggy back off of whatever the single sign on does?
 * title/name
 * Possibly tied into wikidata for automatic localization?
 * summaryText
 * lockState
 * Enumeration, not boolean. Allows for multiple states beyond locked/unlocked,
 * workflowModel
 * contentLanguage
 * Workflows may occur in different languages, this can help act as a filter for workflows a user can understand.

Considerations
For example, if an open (based on lockState?) request for deletion exists on a specific article, cannot open a new one. If a closed RFD exists for a page it should possibly be linked to a newly opened RFD for context. Various restrictions like this need to be definable in a generic way by the wiki community. The prototype implementation may hard code these types of considerations to be fleshed out at a later point.

ID Generation
Flow will be vertically partitioned into a single database shared by all wikis. We will use typical auto-increment within mysql

ObjectMetadata
Each flow object maps to some type of workflow metadata. The metadata is specific to the type of workflow the object is assigned to. What is actually stored in the metadata is up to the individual blocks that make up the workflow. These might be things like additional voting options.

The various models need to be programmable such that local wikis can use them as they need and not be locked into pre-programmed ideas. Templates currently allow a great deal of flexibility with no automated enforcement. These models must represent a middle ground between template flexibility and pre-programmed strict workflows. For that reason the workflow metadata is essentially a key -> value mapping for arbitrary keys on any flow object, much like user options.


 * 1 to Many mapping between FlowObject and the Workflow Metadata rows
 * Each row of the metadata is a key/value pair and the object id
 * Metadata needs to be rather abstract to support all the different use cases. Likely it could be implemented like user options are currently.
 * Collecting all the metadata for an object would be something like SELECT metadata_key,metadata_value FROM flow_metadata where metadata_object_id = '12345'

Workflow
Each wiki will be able to define their own workflows. In the initial prototype the discussion model will be hardcoded implementation, but we must keep in mind the requirement to support multiple kinds of workflows. Iniitally to keep things simple we should use a workflow table in sql which will map from a unique wiki + workflow name to the various wiki specific options that they have configured. We will, after the initial enwiki release, likely move to the wiki page in WDL format for workflow defintions. The options set within a workflow defintiion can range from something as simple as a list of voting options, to LUA scripts that perform a bit of complex logic.

For wiki defined workflows there are at least two options:
 * Real wiki page containing the definition in the WDL format, can perhaps utilize ContentHandler and json?
 * Special page and corresponding model table in database

Possible workflows to support include but are not limited to:
 * 2 way user conversation (user talk page owner<->talker)
 * 3 way user conversation (talk page any<->any)
 * Request for deletion
 * Request for adminship
 * General consensus discussions
 * Village Pump, Forum, etc.
 * AN/I
 * Help desk
 * Barnstars/Wikilove (and other templates of this variety)
 * Block Notices (you've been blocked, click this button to appeal)
 * moar

A workflow will be composed of a set of Blocks(kind of like a widget in a dashboard). The workflow should be able to string together multiple blocks into a single result page.

For example the Discussion workflow is a Summary block at the top of the page and a TopicList block below that. The TopicList is a Block acting as a wrapper around a set of Topic Blocks. A Topic Block is a set of SinglePost Blocks arranged in a tree. (This may be too abstract)

A Users Board, when viewed by the owner, will be a separate workflow. When viewed by most users it will be a standard Discussion workflow. When viewed by the user It is the standard Summary block, and the SubscriptionList block below that. The user by default is subscribed to their own board. This means they will be subscribed to all threads created on that board. They can also subscribe to any thread or use discussion page(any flow object) on any wmf wiki and it will also appear within the same SubscriptionList.

Rather than attempting to make each piece of a block(the topics) subscribable/taggable/etc, Topics within flow will be their own workflow object with an associated Topic workflow defintion. The workflow will be a single block, the Topic block. Subscriptions and tagging will be available to anything with an underlying flow_object.


 * FlowDiscussion
 * 0 or 1 Summary
 * 1 to Many: FlowPostSingular


 * FlowTopic
 * exactly 1 FlowTreeNode with the exact same id, as the root of the post tree
 * 1 to Many: Revisions
 * Stored in some tree representation, likely closure table or materialized path


 * FlowRequestForDeletion
 * 1 Summary (Reason for request)
 * 1 to Many: FlowEnumeratedLines
 * FlowBlockNotice
 * 1 Summary (Block Reason)
 * Functional Elements
 * Button for 'appeal this block'
 * Completely dynamic and described by a 'Workflow Description Language' but not a part of the initial prototype implementation.
 * 1 to Many: FlowPostSingular


 * And Many More

WorkflowAction
This typically represents an action performed by a single user. In some cases like the Summary object there is a single object representing all actions, but internally it will reference the Revisions table which maintains the single-action relationship. In general cases this could be a reply to a message or a vote in a consensus discussion, etc. There will typically be a 1 to Many relationship between a WorkflowObject and WorkflowActions as defined by the WorkflowModel. Most of those WorkflowActions will probably be 1 to Many with the revisions table.


 * FlowSummary - Summary objects are 1 to 1 with a particular model, allowing for arbitrary content to be displayed with the models that require a summary.
 * objectId
 * text
 * comment


 * FlowEnumeratedLines
 * objectId
 * Enum value (e.g. vote yes/no, etc.)
 * text
 * comment


 * FlowPostSingular - A tree structure representing a discussion. Possible methods of storing trees at Flow_Portal/Architecture/Discussion_Storage


 * objectId
 * If sharding by object_id, it may be required for consistency sake to always include the object uuid.
 * createdByUserId
 * replyToFlowPostSingularId?
 * revision
 * content
 * summary
 * etc.

Revision
Posts, Summarys, and possibly other content within Flow needs to be revisioned much like normal wiki pages. Ideally we want to reuse the existing revisions code that is already written. Revisions within flow will be slightly different from revisions of a wiki page though, most notably wiki revisions store a full pages content, but we will have many small fragments that should be individually stored.

Extend the existing revision implementation to use our database by default
 * Very reasonable and pragmatic approach
 * Core revisions map to a page_id, in flow individual blocks and pieces of blocks(like posts in a thread) need their own revisions
 * The easiest way to support this is a 'type' field within the revision, and a matching 1 to Many table for each type from the source table id to its revision list.  Something like a summary can use a summary_revision table with the Flow Object id as its pk, as there is only 1 summary allowed per flow object.  The posts can have a separate tree_node_revision table that maps from post id's to their revision lists.
 * Needs to use the external store servers specific to flow content

Copy from the core Revision object just the revision functionality that flow needs in a new object?

 * There are many years of built up bug fixes and knowledge within the current Revision implementation, ideally we want to benefit from that instead of re-implementing.
 * Any future upgrades/bugfixes/etc to the core revision object will also have to be duplicated into the flow revision to maintain parity.
 * In other words: Maintaining feature parity will be a constant battle, perhaps not a great idea.
 * Is feature parity really necessary?
 * Much of what we want to do is done by the revisions code already

Start from scratch with a brand new revisions implementation?

 * Flow revisions have a different use case than Article revisions
 * Articles have hundreds of revisions
 * If flow stores a revision per post, most posts will only ever have a single revision
 * Other pieces of flow, like Discussion Summarys, will have perhaps dozens or hundreds of revisions but still not nearly as many as the article its related to.
 * Revisions are directly related to flow objects, their pk is (page_id, revision_id) with a fk to object_id
 * Could attempt to define an interface that is shareable between Flow and Core Revisions
 * is there any benefit to that? Where would you want to accept either a flow revision or a wiki page revision as the same thing?
 * Still copy the userCan series of functions, the API for this is reasonable and should be continued
 * Keep use of content handler
 * Revision content should continue to point to a 'text' table
 * Text table content should continue to be stored in External Store. Flow will have its own external store servers separate from the main wiki ES servers.
 * The actual content when rendering pages will typically come from memcache so not worried about needing multiple queries or joins to get revision + content.
 * Even still, we will need to extend the current External Store implementation. Currently if we need content for 100 different revisions we have to make 100 serial requests. We need to extend the implementation such that it intelligently batches queries to the same server to make the minimum number of possible round trips.  Before even doing that it should multiGet what it can out of memcache.
 * Gives the opportunity to simplify revision functionality to that which flow requires
 * Is there actually much simplification going on?

Other thoughts

 * With flow's sharded database model we dont need to push revision content to external storage?
 * Content storage is a separate concern from the revisions, the revisions should continue to point somewhere else(text table/external storea/revisions dont care) for the actual storage. See Flow_Portal/Architecture/Discussion_Storage
 * Materialized views on revision+content data will be stored in memcache, negating any performance penalties of joining revision against a text table or fetching from external store.

Revisions, Abuse Filter, and Patrolling
If we move outside the core Revision scheme, we still need to handle tasks like the Abuse Filter, Patrolling, and more. I met Krinkle while he is in SF and got a quick 20 minute rundown on how some of this works, and all I can say at this point is its an organically designed system that would be significantly difficult to come up with a new solution for. It would also be significantly difficult to integrate into the existing system.

I'm kind of hoping that we can ignore this part at the beginning, and later with the help of Krinkle find some way to integrate with the existing solution.

Problems which need to be addressed:
 * Flow wants to store the HTML+RDFa directly from VE, essentially as an optimization to avoid the parser and parser cache entirely. The current system for auto-magically detecting abuse works by regex'ing against wiki text.  The two are not compatible.
 * Something about Bots/IRC/etc. Basically the bots
 * Anyone who knows more about this system than i do (everyone) feel free to expand this section.

What gets stored in a post?
The post is really a denormalization of the first revision (for who created it and when), and the last revision (for the content). I propose the tree nodes representing a post contain no extra metadata beyond whats needed to represent the tree, and utilize a 1 to Many table to relate the tree node id to a list of revisions. We can denormalize the full data needed into memcache. Falling back to query the minimum and maximum revisions should be a simple point lookup within the 1 to many table.

A Users Board
All flow objects will be subscribe-able by any number of users Generating the users feed is a sort on the subscribed flow objects ordered by last updated date. SQL is going to hate us, mysql cant answer this with a single index.
 * Independant subscription table implementing a Many to One relationship between flow objects and users

Thats ok because we plan to always read it from the cache. We will pre-build this cache as the objects are updated. This will require maintaining two sets of lists. One a simple set, per object, of all users subscribed to that object, and the other a sorted set, per user, containing object ids they are subscribed to ordered by the last update time and capped to a reasonable size(100?). Redis sorted sets will never have contain duplicate ids; adding the same value twice will just update its score(timestamp). They sound like a great solution for this pre-caching.

In relation to above, we must investigate the performance implications of a flow object that has 50k subscriptions. I don't see any particular performance issues in this case for the user with 50k subs, although their usability may decline as the proposed 100 item list will only hold a short period of time. I don't think you can reasonably expect usability from 50k subscriptions anyways.

Remembering what a user has already seen
In addition to individual objects being subscribable, to generate feeds matching the current UI prototype we also need to know what WorkflowModel's have been previously seen by the user
 * Remembering a boolean true seen status for every model instance multiplied by all editors is a ton of data and seems sub-optimal.
 * Could remember the last date a user viewed items from that WorkflowObject within the subscription and only display items newer than that
 * Each topic within a discussion is its own WorkflowObject, so the memory is more granular than just the main page, but is it enough?

Suppressed Revisions
Need comparable functionality
 * The existing revision code for performing the restrictions is fairly simple, we should reuse it as possible. We may have to implement our own user interface to this, although if it could be integrated into the standard suppressed revisions that would be preferable.

Search
Handled user side in javascript, in the backend, or both (probably both)?
 * SolrCloud is on the horizon, we will likely want to hook into that
 * Is it possible SolrCloud can handle our tagging searches as well? Sounds like exactly the kind of thing
 * What elements do the UI call for in search results?

Tags
Individual discussion topics must be taggable. How should the tagging implementation work?
 * Tags can be public or private
 * Should public and private tags be stored in the same table?
 * Conceptually it might be simpler to store them separately.
 * Storing the separately also provides a much stronger guarantee of not accidently displaying private tags
 * I don't know much about SolrCloud yet, but it seems like it would be a strong contender for simplifying tagging

So you want to find a talk page
MediaWiki should work great without flow. After installing the flow extension it should continue to work great and the talk pages will be flow discussions.

Currently you take the title of the page and look it up within the NS_TALK namespace. With flow we need something different, but conceptually similar?

Old talk pages: should the current urls still point to the old talk pages, or should they move and flow replaces them on those urls?
 * Current Talk pages comments likely(?) link to Talk:Something directly. For best results they should go to the new flow discussions?
 * If so, then there is also Talk:Something and we would like to continue pointing to the correct data.
 * Same concern as the 2 points above, but with urls from the internet at large. do talk pages get linked directly from outside with any frequency?

Talk page urls:
 * An article in NS_MAIN: /wiki/Talk:Volcano
 * An article in any other NS: /wiki/NameOfNamespace_Talk:Volcano

Kiss flow urls for first prototype: * /wiki/Special:Flow/NameOfNamespace:Volcano?flow=discussion

Possible Flow urls? Not currently supported in core(to my knowledge, there is quite possibly a hook of some sort).
 * An article in NS_MAIN: /wiki/Flow:Volcano/Talk
 * An article in any other NS: /wiki/NameOfNamespace_Flow:Volcano/Talk

Mapping URL's to Flow objects
The path Flow:Volcano/Talk should not be some super special case, there should be some way to attach specific types of flows to default paths. Visiting the page when nothing currently exists must work much like the current system, where the user is given the opportunity to create that specific thing. While we may initially hard-code /Talk into the flow prototype, it would be much better if the mappings from a name to some defaulted type of flow object is managed on wiki by the community. Additionally i18n and l10n considerations need to be taken into account as not every wiki uses the word 'Talk' for their talk pages. Mostly this matters in relation to GUID generation and what we use as the GUID namespace/name.

For the prototype i propose the following:

/wiki/Special:Flow/NameOfNamespace:Volcano?flow=discussion&name=foo


 * The flow query parameter refers to the name of a Workflow Definition. Every workflow definition must have a unique name(on a per-wiki basis).  Flow will have, by default, a few Workflow Definitions pre-defined.  One of these will be a 'discussion' workflow, and if the flow query parameter is not provided it will be defaulted to 'discussion'.
 * I'm uncertain at this point in time how i18n and l10n considerations come into play with this parameter, partially because it is defined on-wiki in the general case.


 * The name query parameter refers to the Flow Objects name. The flow object table will have a unique index on ( wiki, namespace, title, definitionId, name ).  This allows for multiple instances of a specific workflow to exist at the same time.  Workflows that require only a single object instance to exist must use a blank string as their name.
 * TBH i'm not entirely convinced object names are a great solution here, but I'm undecided on what is a better solution. We need some method of allowing multiple Flow Objects to exist for a given wiki+title+definition to account for workflow like RFD which can be completed and Locked.  In the future it must be possible to create a new RFD and have the old one arround for historical purposes.

Performance Considerations
How much data will flow need to store? For estimation purposes, Wikipedia Statistics shows there are approximatly 26M articles across all wiki's. Not all of these will have talk page, but many will. Assuming they all have talk pages ranging from just a post or two, to a couple thousand posts on the largest, we should expect a lower bound of perhaps 100M individual replies will need to be stored in perhaps 20M seperate discussion graphs. If each reply consumes 1kb of space that puts a lower bound of at least 100GB of post content before we even get into metadata, indexes, etc.

And that is just the discussions, flow will need to handle many more workflows than just discussions.

To help get an idea of the space required i applied the EchoDiscussionParser to one of the enwiki database dumps ( enwiki-latest-pages-meta-current10.xml-p000925001p001325000 ). Within this file it detected:


 * 43952 pages in either Talk: or *_Talk: namespaces
 * 211904 individual section headers
 * 514916 user signatures

That works out to an average of:


 * 5 sections per talk page
 * 2.5 signatures per section
 * 12 signatures per talk page

These pages were of course built up over time, but give us a general idea of the size of the problem we need to handle. It may be worthwhile to re-run this code and split the stats between article talk pages and user talk pages. There will be several orders of magnitude more article talk pages than user talk pages, so if they have different characteristics that may be useful information.

Caching possibilities
See Flow_Portal/Architecture/Memcache

Crazy Idea
One flow instance, many wikis. What if Flow was an independent wiki used as a service by other wikis. This is mostly brainstorming, probably too complex to tackle effectively. Ideas here could possibly propagate into the main implementation ideas if viable.

Benefits:
 * The same flow can be referred from any wiki. For example, commons and en.wiki
 * The database can be independent from any specific wiki
 * All wikis share benefit from work by ops team necessary to provide sharded database support.
 * It could (possibly) replace the mapping of url -> page giving flow full control of its URL structure.

Downsides:
 * cross-site javascript requests?
 * Difficulty for the project to be used in general mediawiki installs outside wikimedia?
 * Many, many more that I probably don't know yet

Questions:
 * How would you refer to a flow from a different wiki via markup?
 * Use standard interwiki links?
 * What do the urls look like for those flows? Do user browsers communicate directly with the central flow instance, or do users talk to the wiki and the wiki talks to flow?
 * If flow is independant, how does flow notify echo to generate notifications?

Inconsequential Things to Consider (or not)

 * use php namespaces?

Reference Material

 * en.wiki info about flow
 * Tumblr mysql sharding presentation
 * Ruby scripts for handling operations side of tumblr sharding