Flow/Architecture

Purpose of this Document
This document is intended to be an early brainstorming session regarding the first flow prototype. Final implementation will likely only tangentially look like what you see here now. In other words this is a pre-implementation brain dump which will be referenced while building the first prototype. In my experience with the ideas for internal architecture laid out the first implementation doesn't take long to actually convert into code.

Big Ideas
Flow is about workflow management. A "discussion" is a type of workflow - a very simple one.

Templates are bad, mmkay
In many cases templates are used to encourage workflow within local wikis. The goal for the various workflow models is to be dynamic enough to be managed from the wiki (by local administrators) to cover use cases currently handled by workflow suggestions inside templates.

WorkflowObject
Globally unique identifier of an individual instance of a workflow. A discussion topic is a FlowObject instance. A request for deletion of page Abc is a FlowObject instance. etc.
 * GUID
 * Home wiki ?
 * Only needed if we decide to have a single combined flow that supports multiple wikis.
 * May not be needed even then.
 * Home page
 * Essentially where is this flow accessed from/was created
 * createDate
 * lastUpdateTime
 * creatorID
 * title/name
 * Possibly tied into wikidata for automatic localization?
 * summaryText
 * lockState
 * Enumeration, not boolean. Allows for multiple states beyond locked/unlocked,
 * workflowModel

Considerations
For example, if an open (based on lockState?) request for deletion exists on a specific article, cannot open a new one. If a closed RFD exists for a page it should possibly be linked to a newly opened RFD for context.

Why GUID?

 * First step to enabling horizontal scalability by removing dependence on a central sequence to give out id numbers
 * Assuming somewhat even distribution of GUIDs can bucket and horizontally shard by GUID (more on this twords the end of the document)
 * The flow object and flow metadata will be used more like a key/value store mapping from GUID to the appropriate hash of information. The relational aspect is limited to the WorkflowModel's
 * Not decided yet, could certainly use standard auto-increment ids

WorkflowMetadata
Each flow object maps to some type of workflow metadata. The metadata is specific to the type of workflow that is being worked with. The specific type of metadata to expect (and therefore the type of Metadata and Model objects to use) is defined by the workflowModel field of the FlowObject.

The various models need to be programmable such that local wikis can use them as they need and not be locked into pre-programmed ideas. Templates currently allow a great deal of flexibility with no automated enforcement. These models must represent a middle ground between template flexibility and pre-programmed strict workflows.


 * 1 to 1 mapping between FlowObject and the Workflow Metadata
 * Use same GUID for both?

Possible workflows to support include but are not limited to:
 * 2 way user conversation (user talk page owner<->talker)
 * 3 way user conversation (talk page any<->any)
 * Request for deletion
 * Request for adminship
 * General consensus discussions
 * Village Pump, Forum, etc.
 * AN/I
 * Help desk
 * Barnstars/Wikilove (and other templates of this variety)
 * Block Notices (you've been blocked, click this button to appeal)
 * moar

FlowDiscussion

 * 1 to Many: FlowPostSingular

FlowRequestForDeletion

 * Reason for request
 * 1 to Many: FlowEnumeratedLines

FlowBlockNotice

 * Block Reason
 * Functional Elements
 * Button for 'appeal this block'
 * Completely dynamic and described by a 'Workflow Description Language' but not a part of the initial prototype implementation.
 * 1 to Many: FlowPostSingular

WorkflowModel
The models represent an action performed by a single user. This could be a reply to a message or a vote in a consensus discussion, etc.

There will typically be a 1 to Many relationship between a WorkflowObject and WorkflowSupportObjects. Support objects model actions by individual users like discussing a topic, voting on an RFD, etc.

FlowEnumeratedLines

 * flowGUID
 * Enum value (e.g. vote yes/no, etc.)
 * text
 * comment

FlowPostSingular

 * flowGUID
 * createdByUserId
 * replyToFlowPostSingularId?
 * revision
 * content
 * summary
 * etc.

FlowPostSingular is really a DAG (directed acyclic graph) Possible methods of storing graphs in sql:


 * Naive. Use an explicit flowPostSingularParentId or equivalent in the FlowPostSingular table.
 * Benefits:
 * Simple. Average SQL user can immediatly understand how it works.
 * Encodes the requirement that each post have exactly one parent directly into the data model
 * Downsides
 * More complex to fetch parts of the graph without fetching all of them
 * The right covering indexes may be provide a middle ground between
 * Edge-adjacency lists. Encodes the relationship between parent/child in an independant table. E.x. nodes table for data and edges table for relationships
 * Possibly allows for a single post to have multiple parents and be display in multiple places (e.g. after a topic split )
 * Alternatively can use unique constraints to enfore single parent if desired.
 * Normalizing the edge relationships into a table that only contains edges and no other unrelated information can reduce IO and memory requirements inside the DB server when computing the graph.
 * Best performance will be achieved using stored procedures for the recursive queries, but stored procedures are very very far from being compatible between databases. Would almost need to write stored procedures for every db we want to support (just mysql?) and fallback SQL for alternative databases.
 * I think we can get away with just solving for MySQL - bh
 * Probably many more
 * Probably many more

To really decide on a model for storing graphs, we need to first know what questions must be answered:


 * User X posts a reply to FlowPostSingular 123. Who needs to be notified?
 * All children of the same parent?
 * All parents up to the top level?
 * User Z starts viewing a conversation starting at FlowPostSingular 444. What posts need to be loaded?
 * All children of the same parent?
 * Recursively select all children and children of children up to max XXX items
 * Which items get pruned when limiting fetch to XXX items?
 * By depth from origional
 * By timestamp

Where does a post's content actually get stored?
FlowPostSingular should probably just be metadata about the post. The literal post content should likely be stored in either wiki markup or the parsoid DOM representation of the wiki markup. It should be stored in a separate table perhaps comparable to revisions?


 * In LiquidThreads the posts are stored as literal wiki pages within the main wiki master database
 * This is not scalable. Loading more and more things onto a singular master database just spells doom and gloom for everyone.
 * Horribly slow. Pages are parsed on the way out every time


 * Duplicate the wiki page tables structure into an independant master
 * Likely impossible from a code perspective. the literal names of tables and the question of which database connection to use is made hundreds of times throughout the code base and can't easily be swapped around depending on where we want to load pages from


 * Storing post data as pre-parsed HTML may be faster to assemble on the fly

A Users Flow (or feed of interesting things)
All flow objects will be subscribe-able by any number of users Generating the users flow is a sort on the subscribed flow objects ordered by last updated date
 * SQL is going to hate you, mysql cant answer this with a single index
 * Could flip the logic around and have background tasks that locate all interested users and push the newest 'interesting item' into their queue. Without any history limitations this could result in a ridiculous explosion of row counts though. This 'flipped logic' is, from 10 miles up, essentially the same as what Echo does when generating notifications.

Suppressed Revisions
Need comparable functionality
 * Could use real wiki pages ala LiquidThreads
 * Could implement a 'work alike' along with some sort of interface to generalize the current suppressed revision code (likely rather difficult)

Possible caching ideas / issues to keep in mind

 * In an ideal world can fetch the rendered HTML fragments without having to ask the main db about the actual content
 * Can store the adjacency list for each top level post (FlowPostSingular with no parent id) in memcache to prevent the recursive query inside db

So you want to find a user talk page
Currently you take the title of the page and look it up within the NS_TALK namespace With flow we need something different, but conceptually similar?

Mapping URL's to Flow objects
???

Horizontally sharding the master database
With GUIDs we could, for example, divide the GUID key space into some future-proof number of buckets (4096?) At first all buckets could be assigned to a single master db, but as things grow buckets could be distributed across multiple masters.

If we horizontally shard by GUID a user's flow (timeline essentially?) cannot be easily built without querying all possible masters. Long tail distribution basically says this will always be slow, especially since php does not support async queries.
 * Would likely need to pre-build the timelines as they happen (like twitter?) rather than a join query between the users subscribed flows and the active flows sorted by the flows most recent update.

Caching possibilities
Too early to really know anything about what this will look like. Profile, Review, Cache as required.

Crazy Idea
One flow instance, many wikis. What if Flow was an independent wiki used as a service by other wikis. This is mostly brainstorming, probably too complex to tackle effectively. Ideas here could possibly propagate into the main implementation ideas if viable.

Benefits:
 * The same flow can be referred from any wiki. For example, commons and en.wiki
 * The database can be independent from any specific wiki
 * All wikis share benefit from work by ops team necessary to provide sharded database support.
 * It could (possibly) replace the mapping of url -> page giving flow full control of its URL structure.

Downsides:
 * cross-site javascript requests?
 * Difficulty for the project to be used in general mediawiki installs outside wikimedia?
 * Many, many more that I probably don't know yet

Inconsequential Things to Consider (or not)

 * use php namespaces?
 * one class per file?

Reference Material

 * en.wiki info about flow
 * Flow Portal
 * Experienced User Responses