Flow/Architecture

DISCLAIMER: All decisions, especially in regards to storage layout, are fluid. We fully expect to delete the data created by the prototype on extensiondb1 before deploying the final solution.

Big Ideas
Flow is about workflow management. A "discussion" is a type of workflow – a very simple one.

Templates are often used to implement ad-hoc workflows.
In many cases local wikis use templates to encourage workflow within them. The goal for Flow's workflow models is to be dynamic enough to be managed by local wiki administrators to cover use cases currently handled by workflow suggestions inside templates. In other words, Flow will implement a whole bunch of Lego pieces, and the individual communities will stick them together into the various workflows they need.

Cross-wiki database
Flow will be vertically partitioned away from core MediaWiki into a single database shared by all wikis. In the future, if necessary, it may be sharded across multiple masters. While the Flow MVP ("Minimum viable project – our initial release) will store comments from all wikis in a single database, it will not implement display of a piece of data on wikis other than the wiki it was created on. There are checks in a variety of places to ensure cross-wiki data is not displayed until we have a chance to focus on the implications (user IDs, page IDs, configuration differences of the wiki and its extensions, and many things we don't even know yet).

ID Generation
Flow uses 128-bit timestamped identifiers that are unique across machines. It stores them in the database as  rows. Flow has a UUID model class to deal with ids in either binary or hex as necessary; internally this generates IDs using the  class in MediaWiki core.

Workflow
Globally unique identifier of an individual instance of a workflow. A discussion topic is a Workflow instance (for example, /wiki/User_talk:Maryana?workflow=0506af29cfae6e5b09a3fa163e68c4ac). A request for deletion of page Abc is a Workflow instance. etc.

To do (potentially) Use Case for lock state (or some other solution):
 * ID - A unique identifier of this workflow
 * wikiId - The wiki id of the owning Title
 * pageId - The article id of the owning Title
 * namespace - The numeric namespace of the owning Title
 * titleText - The db key of the owning Title
 * userId - The user id of the initial workflow creator
 * userText - A user name of the initial workflow creator
 * definitionId - Id of the flow Definition this workflow is a type of
 * contentLanguage
 * Workflows may occur in different languages, this can help act as a filter for workflows a user can understand.
 * lock State - Or some other method of marking workflows with a state, such as closed

If an open request for deletion exists on a specific article, cannot open a new one. If a closed RFD exists for a page it should be linked to the newly opened RFD for context. Various restrictions like this need to be definable in a generic way by the wiki community.

Definition
Each wiki will be able to define its own types of workflows (discussion, consensus-voting workflows, etc.). We are calling a type of workflow a definition. The MVP hardcodes two workflow definitions to support peer-to-peer content discussions, "topic" and "discussion". To keep things simple, we are using a flow_definition table in sql, which primarily uses a wikiId + workflow name with a unique index for lookup. The options set within a workflow definition are intended to eventually range from something as simple as a list of voting options to LUA scripts that perform a bit of complex logic.
 * ID - A unique identifier of this definition
 * type - String identifying which implementation this definition configures. In the MVP either 'topic' or 'discussion'
 * wikiId - The wiki id this belongs to
 * name - A Unique(per wiki) name used to identify this definition on wiki
 * options - A key/value map (stored as serialized array currently, as its mostly unused)

Possible workflows to support (eventually) include but are not limited to:
 * two-way user conversation (user talk page owner<->talker)
 * three-way user conversation (talk page any<->any)
 * Request for deletion
 * Request for adminship
 * General consensus discussions
 * Village Pump, Forum, etc.
 * AN/I
 * Help desk
 * Barnstars/Wikilove (and other templates of this variety)
 * Block Notices (you've been blocked, click this button to appeal)
 * moar

TopicListEntry
The topic list is an N to M relation between workflows. Initial use case is a parent discussion workflow is related to many topic workflows within the discussion. These topics can then be included into other discussions. (Considering renaming and adding a type field, to allow generic N to M relations if use cases arise)


 * topicListId - UID of the parent workflow
 * topicId - UID of the child workflow

AbstractRevision

 * revId - UIDGenerator::newTimestampedUID128
 * userId - Id of the user that created this (from the wiki id of the owning workflow, identified by the concrete revisions)
 * userText - The user name that created this
 * parentId - A revision id that this revision is based on
 * changeType - A string identifying the type of action that created this revision(not user generated)
 * type - A string identifying the concrete revision type
 * content - The content, or if ExternalStore is enabled a URL from ES.
 * flags - Array(comma separated in storage) of string flags that apply specifically to the content. Examples include utf-8, html, etc.
 * modState - String identifier of the revisions current moderation state
 * modUserId - Denormalized Id of the user that most recently moderated this revision (from the wiki id of the owning workflow)
 * modUserText - Denormalized user name of the moderating user
 * modTimestamp - Denormalized wfTimestampNow created when moderation most recently occurred
 * lastEditId - Denormalized UID of the revision that is the last content edit
 * lastEditUserId - Denormalized id of the user (from the wiki id of the owning workflow) that performed the last content edit
 * lastEditUserText - Denormalized name of the user that performed the last content edit

Notes:
 * The canonical source of who moderated what when is to look at the changeType of all related revisions and who created the revisions with moderation related change types. The denormalized moderation information is related to the most recent moderation action performed against a series of revisions without changing the canonical information they store.
 * The canonical source of who edited the content when is to check which revisions content does not match their parent. This is denormalized so the data model can expose the most recent content editor to the UI without performing extra lookups.

HeaderRevision
A very simple piece of revisionable content displayed at the top of every discussion with no custom aspects beyond AbstractRevision
 * workflowId - The workflow that this header belongs to

PostRevision
A revisionable piece of content related to other posts in a 1 to N parent/child relationship.

While not explicitly defined in each post, the owning workflow has the same UID as the postId of only parent with a null replyToId in a tree of posts. This post represents the title of the topic. This is cached elsewhere.
 * replyToId - UID of the posts parent
 * postId - UID of this post
 * revId - UID of the related AbstractRevision
 * origCreateTime - Denormalized creation time (extracted from UID) of the first revision of this post
 * origUserId - Denormalized user id (from the wiki id of the owning workflow) that created the first revision of this post
 * origUserText - Denormalized user name that created the first revision of this post

<!-- this section is only relevant to future developers that wish to edit flow code and understand why its arranged as it is

Ramblings on Workflows and Blocks

 * See also Flow Nomenclature for the user-facing terminology

A workflow will be composed of a set of Blocks (kind of like a widget in a dashboard). The workflow should be able to string together multiple blocks into a single result page.

For example, the Discussion workflow that you see in the MVP is a Header block at the top of the page and a TopicList block below that. The TopicList is a Block acting as a wrapper around a set of Topic Blocks. A Topic Block points to the root of a post tree, posts themselves are not blocks and are handled via the Topic block. (This may be too abstract). Block implementations heavily resemble Controllers in a standard MVC application. They accept user input, do something in the application layer with it, and then generate output for the user.

A user's Board, when viewed by the owner, will be a separate workflow. When viewed by other users, it will be a standard Discussion workflow. But when viewed by the owner, it is the standard Header block, but with a SubscriptionList(not implemented yet) block below that. A SubscriptionList is very similar to a TopicList but has different semantics for user interaction. A user by default is subscribed to their own board. This means they will be subscribed to all threads created on that board. Users will be able to subscribe to any Flow workflow(discussion, topic, etc) on any WMF wiki, and it will also appear within the same SubscriptionList.

Rather than attempting to make each piece of a block (the topics) subscribable/taggable/etc, Topics within Flow are their own workflow object with an associated Topic workflow definition. The workflow will be a single block, the Topic block. Subscriptions and tagging will be available to any workflow instance.
 * discussion workflow


 * 0 or 1 Header - A discussion can have or not have a header, same UID as the workflow
 * 0 to Many Topics: A discussion is related to many topics

Future:
 * topic
 * Exactly 1 root post with the same UID as the workflow
 * Root post is the revisionable topic title, replies to the topic are children of the root post
 * Currently uses a Closure table sql implementation to avoid recursive queries to find the root of posts multiple levels deep. This has a size complexity of N * M where N is the number of posts and M is the average depth. The average depth is estimated to be between 2 and 5 within Flow based on current user interface that encourages depth < 4. Some can still climb to 10 deep or more but on average we believe it will stay low.
 * We are still not sure the closure table is a reasonable implementation due to the size complexity. Within the application this is hidden behind an interface and we are considering trying out a materialized path implementation within sql, while retaining the same memcache structure.  The concern with materialized paths is that we would need to choose an index length within SQL for the concatenated root path.  We would then have to enforce a hard limit on nesting, but that isn't the end of the world a depth of 10 require an indexed length of 160 bytes.
 * Within memcache we store the path from every post to its root ( post->parent->parent->root ) as a serialized array. For the root we store a serialized associative array of child => parent relationships of all depths within the root. This is enough information to rebuild the tree and very easy to add posts to.  Loading posts is typically done by fetching the map for the root related to a workflow and ensuring the node is within the tree.  Whatever posts are needed for context are then loaded along with the post.


 * FlowRequestForDeletion
 * 1 Summary (Reason for request)
 * 1 to Many: FlowEnumeratedLines
 * FlowBlockNotice
 * 1 Summary (Block Reason)
 * Functional Elements
 * Button for 'appeal this block'
 * Completely dynamic and described by a 'Workflow Description Language' but not a part of the initial prototype implementation.
 * 1 to Many: FlowPostSingular


 * And Many More

Types of definitions
Flow will need a variety of configurable definitions. One definition can refer to multiple other definitions to create more complex workflows. In the MVP these definitions are not configurable, beyond the references to other definitions.
 * Discussion
 * Refers to a Header definition
 * Refers to a Topic definition


 * Header
 * Header definitions are 1 to 1 with its owning workflow, allowing for arbitrary content to be displayed with workflows that desire a header
 * The content is revisioned

Future: -->
 * Topic
 * Contains a tree of posts. The root post in the tree has the same UID as the owning topic workflow and represents the revisionable title of the topic.
 * Voting
 * At a minimum we need a configurable voting type (probably an extended topic implemtation)
 * Minimally needs enum value (e.g. vote yes/no, etc.) and text content of user reason

Revisions
Posts, Headers, and possibly a wide variety of content within Flow needs to be revisioned much like normal wiki pages.
 * Flow revisions have a different use case than wiki Article revisions
 * Articles have hundreds to (tens of?) thousands of revisions
 * Flow currently stores a revision per post, most posts will only ever have a single revision. Posts that go through a few edits and moderation cycles will still only have 10's of revisions.
 * Other pieces of flow, such as discussion headers, will have perhaps dozens or hundreds (thousands wouldn't be unreasonable) of revisions, but still not nearly as many as an article.
 * Flow revisions need different metadata depending on where the piece is used (header, post, etc).
 * Content is stored in External Store (same as article revisions in core MW). You can configure the ExernalStore servers that Flow uses independent from core.
 * The actual content when rendering pages will typically come from memcache so not as worried about needing multiple queries or joins to get revision + content.
 * External Store has been extended to batch requests for multiple pieces of content such that only 1 query is issued per server.

Other thoughts

 * Content storage is a separate concern from the revision metadata, the revisions should continue to point somewhere else(text table/external store/revisions dont care) for the actual storage.
 * Currently we fetch the content and store it with the metadata in memcache. This is primarily fetched via multiGet, hopefully negating much of the performance penalty related to fetching the content separately.
 * We would like to transition to the content storage API that gwicke is working on for parsoid when it is ready.
 * See /Discussion_Storage (outdated)

Abuse Filter, and Patrolling (future)
Flow revisions are stored outside the core Revision scheme, but we still need to handle tasks like the Abuse Filter, Patrolling, and more. I met Timo while he was in SF and got a quick 20 minute rundown on how some of this works, and all I can say at this point is its an organically designed system for which a new solution would be significantly difficult. It would also be significantly difficult to integrate into the existing system.

For the MVP we are planning a small targeted release to only a few wiki projects. We are optimistic that such a small release will be amicable to moderation by users. We must integrate the with the existing infrastructure before any wide release.

Problems which need to be addressed:
 * Flow wants to store the HTML+RDFa directly from VE, essentially as an optimization to avoid the parser and parser cache entirely. The current system for auto-magically detecting abuse works by regex'ing against wiki text.  This basically means regardless of what the user sends us in an edit, we will have to use parsoid to either get wikitext for html to send to abuse filter, or use parsoid to get html for the wikitext for storage.
 * Bots are able to access Flow using its own API, which gives access equivalent to UI access. However, they will not be able to use standard APIs for editing, moving and protecting, because Flow is its own thing and does not use the MediaWiki page model.
 * Since Flow actions are sent to Recent Changes, integrating with IRC will work correctly.
 * What else do we need to say about bots?

A User's Board (future)
All flow objects will be subscribe-able by any number of users Generating the users feed is a sort on the subscribed workflows ordered by last updated date. SQL is going to hate us, mysql can't answer this with a single index.
 * Independent subscription table implementing a Many to One relationship between workflows and users

That's ok because we plan to always read it from the cache. We will pre-build this cache as the objects are updated. This will require maintaining two sets of lists. One a simple set, per object, of all users subscribed to that object, and the other a sorted set, per user, containing object ids they are subscribed to ordered by the last update time and capped to a reasonable size(100?). Redis sorted sets will never have contain duplicate ids; adding the same value twice will just update its score(timestamp). They sound like a great solution for this pre-caching.

In relation to above, we must investigate the performance implications of a workflow that has 50k subscriptions. I don't see any particular performance issues(beyond what watchlist already has when fetching the uncached copy) in this case for the user with 50k subs, although their usability may decline as the proposed 100 item list will only hold a short period of time. I don't think you can reasonably expect usability from 50k subscriptions anyways, unless you get into some fancy Facebook-like "does the user want to see this?" stuff which we probably don't want to go down.

Remembering what a user has already seen (future)
In addition to individual objects being subscribable, to generate feeds matching the current UI prototype we also need to know what Workflowl's have been previously seen by the user
 * Remembering a boolean true seen status for every model instance multiplied by all editors is a ton of data and seems sub-optimal.
 * Could remember the last date a user viewed items from that Workflow within the subscription and only display items newer than that
 * Each topic within a discussion is its own Workflow, so the memory is more granular than just the main page, but is it enough?

Suppressed Revisions
Content within flow must support user moderation with multiple levels of restriction. To accomplish this the Flow MVP has a moderation state field stored with the revision that identifies its current level of restriction. Typical accessor methods for user generated content in the revisions accept a user object and return either the content or an "access denied" message, as appropriate to the restriction level.

While all revisions use the same abstract revision and the same revision storage, the actions performed when applying moderation are allowed to vary between concrete revision types.

Current moderation states: hide (available to everyone as a click-through), delete (available only to admins), suppress (available only to "suppressors")

Headers - WIP, the history page that the interface for moderating headers goes on will soon be merged (read: not suppressible from user perspective, but the data model supports moderation)

Posts - Still refining, current patch moderates all historical revisions of the post with newest moderation state. It then inserts a new revision created by the moderating user with an appropriate change type to track who moderated what when. This patch also writes a username and id to the historical revisions to satisfy visual design, but this is not a canonical source and only represents the most recent moderator.

Search (future)
Handled user side in javascript, in the backend, or both (probably both)?
 * Elastic Search is on its way. We will be integrating with it as possible.
 * We think Elastic Search can likely handle our tagging searches as well.
 * What elements do the UI call for in search results?

Tags (future)
Individual discussion topics must be taggable. How should the tagging implementation work?
 * Tags can be public or private
 * Should public and private tags be stored in the same table?
 * Conceptually it might be simpler to store them separately.
 * Storing the separately also provides a much stronger guarantee of not accidentally displaying private tags
 * We have not delved into Elastic Search yet, but it seems like it would be a strong contender for simplifying tagging

So you want to find a talk page
For the MVP Flow will only be enabled on the talk pages of a few Wikiprojects that have agreed to trial the software. We explicitly do not want to enable Flow onto just any user defined pages or entire talk namespaces, but instead to limit the release to these specific areas  To accomplish this the Flow PHP configuration includes a list of article titles it will be seen on. The existing page must be moved (we suggest to a /Archive subpage) before enabling Flow for it.

This is a targeted solution which specifically solves the needs of the MVP, it will need to change before considering wider release.

Mapping URLs to Flow workflows
There are two main types of workflow definitions: unique and everything else. A workflow of a unique definition type can be requested with only the name of the definition, no uid required, e.g.  /wiki/Wikipedia_Talk:Project_Volcanos?flow=discussion&action=edit-header


 * The flow query parameter refers to the name of a Workflow Definition. Every workflow definition must have a unique name (on a per-wiki basis).  As already explained Flow has a few Workflow Definitions pre-defined.  One of these is a 'discussion' workflow, and if the flow query parameter is not provided it will default to 'discussion'.
 * The action query parameter is also optional and defaults to 'view'

Non-unique workflows must be referenced with a workflow query parameter giving their full 128-bit uid, e.g.  /wiki/Wikipedia_Talk:Project_Volcanos?action=topic-history&workflow=0506f5dab48e6e5b09a3fa163e68c4ac


 * actions for non-unique workflows include edit-header for a discussion, edit-title for a topic, edit for a single post, moderation actions for a post, etc.

Performance considerations
How much data will flow need to store? For estimation purposes, Wikipedia Statistics shows there are approximately 26M articles across all wikis. Not all of these will have talk page, but many will. Assuming they all have talk pages ranging from just a post or two, to a couple thousand posts on the largest, we should expect a lower bound of perhaps 100M individual replies will need to be stored in perhaps 20M seperate discussion graphs (not today, or even within the first year, but an approximation of size within a few years). If each reply consumes 1kb of space that puts a lower bound of at least 100GB of post content before we even get into metadata, indexes, etc.

And that is just the discussions, Flow will eventually handle many more workflows than just discussions.

To help get an idea of the space required I applied the EchoDiscussionParser to one of the enwiki database dumps ( enwiki-latest-pages-meta-current10.xml-p000925001p001325000 ). Within this file it detected:


 * 43952 pages in either Talk: or *_Talk: namespaces
 * 211904 individual section headers
 * 514916 user signatures

That works out to an average of:


 * 5 sections per talk page
 * 2.5 signatures per section
 * 12 signatures per talk page

These pages were of course built up over time, but give us a general idea of the size of the problem we need to handle.

It may be worthwhile to re-run this code and split the stats between article talk pages and user talk pages. There will be several orders of magnitude more article talk pages than user talk pages, so if they have different characteristics that may be useful information.

Caching possibilities
See /Memcache

Reference Material

 * en.wiki info about flow