Flow/Architecture

Purpose of this Document
This document is intended to be an early brainstorming session regarding the first flow prototype. Final implementation will likely only tangentially look like what you see here now. In other words, this is a pre-implementation brain dump which will be referenced while building the first prototype. In my experience, with the ideas for internal architecture laid out, the first implementation doesn't take long to actually convert into code.

DISCLAIMER: All decisions, especially in regards to storage layout, are fluid. We fully expect to delete the data created by the prototype on extensiondb1 before deploying the final solution.

Big Ideas
Flow is about workflow management. A "discussion" is a type of workflow - a very simple one.

Templates are often used to implement ad-hoc workflows.
In many cases templates are used to encourage workflow within local wikis. The goal for the various workflow models is to be dynamic enough to be managed from the wiki (by local administrators) to cover use cases currently handled by workflow suggestions inside templates. In other words, Flow will implement a whole bunch of Lego pieces, and the individual communities will stick them together into the various workflows they need.

Cross wiki database
Flow will be vertically partitioned away from core Mediawiki into a single database shared by all wikis. In the future, if necessary, it may be sharded across multiple masters. While the Flow MVP will store all wikis in the same database, we have not implemented, for the MVP, display of data on wikis other than the wiki it was created on. There are currently checks in a variety of places to ensure cross-wiki data is not displayed until we have a chance to focus on the various implications this creates (user id's, page id's, configuration differences of the wiki and its extensions, many things I dont even know yet).

ID Generation
128 bit timestamped identifiers will be generated with the UIDGenerator in mediawiki. They are stored in the database as binary(16) rows. Flow uses a UUID model class to deal with ids in either binary or hex as necessary.

Workflow
Globally unique identifier of an individual instance of a workflow. A discussion topic is a Workflow instance. A request for deletion of page Abc is a Workflow instance. etc.

To do (potentially) Use Case for lock state (or some other solution):
 * ID - A unique identifier of this workflow
 * wikiId - The wiki id of the owning Title
 * pageId - The article id of the owning Title
 * namespace - The numeric namespace of the owning Title
 * titleText - The db key of the owning Title
 * userId - The user id of the initial workflow creator
 * userText - A user name of the initial workflow creator
 * definitionId - Id of the flow Definition this workflow is a type of
 * contentLanguage
 * Workflows may occur in different languages, this can help act as a filter for workflows a user can understand.
 * lock State - Or some other method of marking workflows with a state, such as closed

If an open request for deletion exists on a specific article, cannot open a new one. If a closed RFD exists for a page it should be linked to the newly opened RFD for context. Various restrictions like this need to be definable in a generic way by the wiki community.

Definition
Each wiki will be able to define their own types of workflows (discussion, various consensus voting workflows, etc.). We are calling a type of workflow a definition. In the MVP, the discussion model is a hard-coded implementation. To keep things simple, we are using a definition table in sql, which primarily uses a wikiId + workflow name with a unique index for lookup. The options set within a workflow definition are intended to eventually range from something as simple as a list of voting options to LUA scripts that perform a bit of complex logic.
 * ID - A unique identifier of this definition
 * type - string representation which implementation this definition configures. Currently 'topic' or 'discussion'
 * wikiId - The wiki id this belongs to
 * name - A Unique(per wiki) name used to identify this definition on wiki
 * options - A key/value map (stored as serialized array currently, as its mostly unused)

Possible workflows to support (eventually) include but are not limited to:
 * 2 way user conversation (user talk page owner<->talker)
 * 3 way user conversation (talk page any<->any)
 * Request for deletion
 * Request for adminship
 * General consensus discussions
 * Village Pump, Forum, etc.
 * AN/I
 * Help desk
 * Barnstars/Wikilove (and other templates of this variety)
 * Block Notices (you've been blocked, click this button to appeal)
 * moar

TopicListEntry
The topic list is an N to M relation between workflows. Initial use case is a parent discussion workflow is related to many topic workflows within the discussion. These topics can then be included into other discussions. (Considering renaming and adding a type field, to allow generic N to M relations if use cases arise)


 * topicListId - UID of the parent workflow
 * topicId - UID of the child workflow

AbstractRevision

 * revId - UIDGenerator::newTimestampedUID128
 * userId - Id of the user that created this (from the wiki id of the owning workflow, identified by the concrete revisions)
 * userText - The user name that created this
 * parentId - A revision id that this revision is based on
 * changeType - A string identifying the type of action that created this revision(not user generated)
 * type - A string identifying the concrete revision type
 * content - The content, or if ExternalStore is enabled a url from ES.
 * flags - Array(comma separated in storage) of string flags that apply specifically to the content. Examples include utf-8, html, etc.
 * modState - String identifier of the revisions moderation state
 * modUserId - Id of the user that moderated this revision (from the wiki id of the owning workflow)
 * modUserText - The user name of the moderating user
 * modTimestamp - wfTimestampNow created when moderation occured
 * lastEditId - denormalized UID of the revision that is the last content edit
 * lastEditUserId - denormalized id of the user (from the wiki id of the owning workflow) that performed the last content edit
 * lastEditUserText - denormalized name of the user that performed the last content edit

HeaderRevision
A very simple piece of revisionable content displayed at the top of every discussion with no custom aspects beyond AbstractRevision
 * workflowId - The workflow that this header belongs to

PostRevision
A revisionable piece of content related to other posts in a 1 to N parent/child relationship.

While not explicitly defined in each post, the owning workflow has the same UID as the postId of only parent with a null replyToId in a tree of posts. This post represents the title of the topic. This is cached elsewhere.
 * replyToId - UID of the posts parent
 * postId - UID of this post
 * revId - UID of the related AbstractRevision
 * origCreateTime - denormalized creation time (extracted from UID) of the first revision of this post
 * origUserId - denormalized user id (from the wiki id of the owning workflow) that created the first revision of this post
 * origUserText - denormalized user name that created the first revision of this post

Ramblings on Workflows and Blocks
A workflow will be composed of a set of Blocks (kind of like a widget in a dashboard). The workflow should be able to string together multiple blocks into a single result page.

For example, the Discussion workflow is a Summary block at the top of the page and a TopicList block below that. The TopicList is a Block acting as a wrapper around a set of Topic Blocks. A Topic Block is a set of SinglePost Blocks arranged in a tree. (This may be too abstract)

A User's Board, when viewed by the owner, will be a separate workflow. When viewed by most users, it will be a standard Discussion workflow. When viewed by the user, it is the standard Summary block, and the SubscriptionList block below that. A user by default is subscribed to their own board. This means they will be subscribed to all threads created on that board. They can also subscribe to any thread or use discussion page (any Flow object) on any WMF wiki, and it will also appear within the same SubscriptionList.

Rather than attempting to make each piece of a block (the topics) subscribable/taggable/etc, Topics within flow will be their own workflow object with an associated Topic workflow definition. The workflow will be a single block, the Topic block. Subscriptions and tagging will be available to anything with an underlying flow_object.
 * discussion workflow


 * 0 or 1 Header - A discussion can have or not have a header, same UID as the workflow
 * 0 to Many Topics: A discussion is related to many topics

Future:
 * topic
 * Exactly 1 root post with the same UID as the workflow
 * Root post is the revisionable topic title, replies to the topic are children of the root post
 * Currently uses a Closure table sql implementation and cached copies of the query results in memcache to perform lookups like given post X what is its root post.
 * Likely not scalable, needs better solution


 * FlowRequestForDeletion
 * 1 Summary (Reason for request)
 * 1 to Many: FlowEnumeratedLines
 * FlowBlockNotice
 * 1 Summary (Block Reason)
 * Functional Elements
 * Button for 'appeal this block'
 * Completely dynamic and described by a 'Workflow Description Language' but not a part of the initial prototype implementation.
 * 1 to Many: FlowPostSingular


 * And Many More

Types of Definitions
Flow will need a variety of configurable definitions. One definition can refer to multiple other definitions to create more complex workflows. In the MVP these definitions are not configurable, beyond the references to other definitions.
 * Discussion
 * Refers to a Header definition
 * Refers to a Topic definition


 * Header
 * Header definitions are 1 to 1 with its owning workflow, allowing for arbitrary content to be displayed with workflows that desire a header
 * The content is revisioned

Future:
 * Topic
 * Contains a tree of posts. The root post in the tree has the same UID as the owning topic workflow and represents the revisionable title of the topic.
 * Voting
 * At a minimum we need a configurable voting type (probably an extended topic implemtation)
 * Minimally needs enum value (e.g. vote yes/no, etc.) and text content of user reason

Revisions
Posts, Headers, and possibly a wide variety of content within Flow needs to be revisioned much like normal wiki pages.
 * Flow revisions have a different use case than Article revisions
 * Articles have hundreds to (tens of?) thousands of revisions
 * Flow currently stores a revision per post, most posts will only ever have a single revision. Posts that go through a few edits and moderation cycles will still only have 10's of revisions.
 * Other pieces of flow, like discussion headers, will have perhaps dozens or hundreds(thousands wouldn't be unreasonable) of revisions but still not nearly as many as the article its related to.
 * Flow revisions need different metadata depending on where it is used(header, post, etc).
 * Content is (like core) stored in External Store. The ES servers used are configurable independent from core.
 * The actual content when rendering pages will typically come from memcache so not as worried about needing multiple queries or joins to get revision + content.
 * External Store has been extended to batch requests for multiple pieces of content such that only 1 query is issued per server.

Other thoughts

 * Content storage is a separate concern from the revision metadata, the revisions should continue to point somewhere else(text table/external store/revisions dont care) for the actual storage.
 * Currently we fetch the content and store it with the metadata in memcache. This is primarily fetched via multiGet, hopefully negating much of the performance penalty related to fetching the content separately.
 * We would like to transition to the content storage API gwicke is working on in relation to parsoid when its ready.
 * See Flow_Portal/Architecture/Discussion_Storage (outdated)

Abuse Filter, and Patrolling (future)
If we move outside the core Revision scheme, we still need to handle tasks like the Abuse Filter, Patrolling, and more. I met Timo while he was in SF and got a quick 20 minute rundown on how some of this works, and all I can say at this point is its an organically designed system that would be significantly difficult to come up with a new solution for. It would also be significantly difficult to integrate into the existing system.

We are, for the MVP, planning a small targeted release to only a few wiki projects. We are optimistic that such a small release will be amicable to moderation by users. We must integrate the with the existing infrastructure before any wide release.

Problems which need to be addressed:
 * Flow wants to store the HTML+RDFa directly from VE, essentially as an optimization to avoid the parser and parser cache entirely. The current system for auto-magically detecting abuse works by regex'ing against wiki text.  This basically means regardless of what the user sends us in an edit, we will have to use parsoid to either get wikitext for html to send to abuse filter, or use parsoid to get html for the wikitext for storage.
 * Bots are able to access Flow using its own API, which gives access equivalent to UI access. However, they will not be able to use standard APIs for editing, moving and protecting, because Flow is its own thing and does not use the MediaWiki page model.
 * Since Flow actions are sent to Recent Changes, integrating with IRC will work correctly.
 * What else do we need to say about bots?

A User's Board (future)
All flow objects will be subscribe-able by any number of users Generating the users feed is a sort on the subscribed flow objects ordered by last updated date. SQL is going to hate us, mysql cant answer this with a single index.
 * Independent subscription table implementing a Many to One relationship between flow objects and users

That's ok because we plan to always read it from the cache. We will pre-build this cache as the objects are updated. This will require maintaining two sets of lists. One a simple set, per object, of all users subscribed to that object, and the other a sorted set, per user, containing object ids they are subscribed to ordered by the last update time and capped to a reasonable size(100?). Redis sorted sets will never have contain duplicate ids; adding the same value twice will just update its score(timestamp). They sound like a great solution for this pre-caching.

In relation to above, we must investigate the performance implications of a flow object that has 50k subscriptions. I don't see any particular performance issues in this case for the user with 50k subs, although their usability may decline as the proposed 100 item list will only hold a short period of time. I don't think you can reasonably expect usability from 50k subscriptions anyways, unless you get into some fancy Facebook-like "does the user want to see this?" stuff which we probably don't want to go down.

Remembering what a user has already seen (future)
In addition to individual objects being subscribable, to generate feeds matching the current UI prototype we also need to know what WorkflowModel's have been previously seen by the user
 * Remembering a boolean true seen status for every model instance multiplied by all editors is a ton of data and seems sub-optimal.
 * Could remember the last date a user viewed items from that WorkflowObject within the subscription and only display items newer than that
 * Each topic within a discussion is its own WorkflowObject, so the memory is more granular than just the main page, but is it enough?

Suppressed Revisions
Content within flow must support user moderation with multiple levels of restriction. To accomplish this the Flow MVP has a moderation state field stored with the revision that identifies its current level of restriction. Typical accessor methods in the revisions accept a user object and return user generated content or an "access denied" message, as appropriate to the restriction level.

While all revisions use the same abstract revision and the same revision storage, the actions performed when applying moderation vary between revision types.

Current moderation states: hide (available to everyone as a click-through), delete (available only to admins), suppress (available only to "suppressors")

Headers - WIP, the history page that the interface for moderating headers goes on will soon be merged (read: not suppressible from user perspective, but the data model supports moderation)

Posts - Still refining, current patch moderates all historical revisions of the post with newest moderation state. It then inserts a new revision created by the moderating user with an appropriate change type to track who moderated what when. This patch also writes a username and id to the historical revisions to satisfy visual design, but this is not a canonical source and only represents the most recent moderator.

Search (future)
Handled user side in javascript, in the backend, or both (probably both)?
 * SolrCloud is on the horizon, we will likely want to hook into that
 * Is it possible SolrCloud can handle our tagging searches as well? Sounds like exactly the kind of thing
 * What elements do the UI call for in search results?

Tags (future)
Individual discussion topics must be taggable. How should the tagging implementation work?
 * Tags can be public or private
 * Should public and private tags be stored in the same table?
 * Conceptually it might be simpler to store them separately.
 * Storing the separately also provides a much stronger guarantee of not accidentally displaying private tags
 * I don't know much about SolrCloud yet, but it seems like it would be a strong contender for simplifying tagging

So you want to find a talk page
For the MVP flow is deploying to only a few wikiprojects. We explicitly do not want to allow flow onto just any user defined pages, but to limit the release to very specific areas that have agreed to demo the software. To accomplish this the Flow configuration includes a list of article titles it will be seen on. The existing page must be moved(suggest subpage) before Flow takes over the article.

This is a targeted solution which specifically solves the needs of the MVP, it will need to change before considering wider release.

Mapping URLs to Flow workflows
There are two main types of workflow definitions: unique and everything else. A workflow of a unique definition type can be requested with /wiki/Wikipedia_Talk:Project_Volcanos?flow=discussion&action=new-topic


 * The flow query parameter refers to the name of a Workflow Definition. Every workflow definition must have a unique name(on a per-wiki basis).  Flow will have, by default, a few Workflow Definitions pre-defined.  One of these will be a 'discussion' workflow, and if the flow query parameter is not provided it will default to 'discussion'.
 * The action query parameter is optional and defaults to 'view'

Non-unique workflows must be referenced with their full 128bit uid. /wiki/Wikipedia_Talk:Project_Volcanos?action=topic-history&workflow=0506f5dab48e6e5b09a3fa163e68c4ac

Performance Considerations
How much data will flow need to store? For estimation purposes, Wikipedia Statistics shows there are approximatly 26M articles across all wiki's. Not all of these will have talk page, but many will. Assuming they all have talk pages ranging from just a post or two, to a couple thousand posts on the largest, we should expect a lower bound of perhaps 100M individual replies will need to be stored in perhaps 20M seperate discussion graphs(not today, or even within the first year, but an approximation of size within a few years). If each reply consumes 1kb of space that puts a lower bound of at least 100GB of post content before we even get into metadata, indexes, etc.

And that is just the discussions, flow will need to handle many more workflows than just discussions.

To help get an idea of the space required i applied the EchoDiscussionParser to one of the enwiki database dumps ( enwiki-latest-pages-meta-current10.xml-p000925001p001325000 ). Within this file it detected:


 * 43952 pages in either Talk: or *_Talk: namespaces
 * 211904 individual section headers
 * 514916 user signatures

That works out to an average of:


 * 5 sections per talk page
 * 2.5 signatures per section
 * 12 signatures per talk page

These pages were of course built up over time, but give us a general idea of the size of the problem we need to handle.

It may be worthwhile to re-run this code and split the stats between article talk pages and user talk pages. There will be several orders of magnitude more article talk pages than user talk pages, so if they have different characteristics that may be useful information.

Caching possibilities
See /Memcache

Reference Material

 * en.wiki info about flow
 * Tumblr mysql sharding presentation
 * Ruby scripts for handling operations side of tumblr sharding