Flow/Architecture/Discussion Storage

= Questions the data model must answer =

To really decide on a model for storing graphs, we need to first know what questions must be answered:


 * User X posts a reply to FlowPostSingular 123. Who needs to be notified?
 * All children of the same parent?
 * All parents up to the top level?
 * User Z starts viewing a conversation starting at FlowPostSingular 444. What posts need to be loaded?
 * All children of the same parent?
 * Recursively select all children and children of children up to max XXX items
 * Which items get pruned when limiting fetch to XXX items?
 * By depth from original
 * By timestamp
 * Allowing limit by timestamp means we will either need to include the last updated timestamp in the node data or join against the revisions table(bad idea i think) to get the timestamp.
 * other prune options???
 * Given a single post id, what FlowObject(GUID) does it map to ?
 * We shouldn't need to keep a copy of the GUID in every node, it should be sufficient to link from the metadata to the root node. Determining what flow object a post maps to involves looking up the root post linked to that post and then finding the metadata with that root post.
 * Not storing the GUID in every node should noticeably reduce the size of indexes required. Smaller indexes are almost always a win.
 * Some things won't naturally have a root post, for simplicity sake i think we should use a placeholder root with no content in those cases.

Important properties that the final storage solution must have:
 * Must be able to remember which posts a user has seen, and filter the result based on that
 * Possibly this will be better served happening outside the proposed storage interface. Potentially the interface is just returning the structure of the discussion, but without any of the discussion specific information.  Figuring out which parts of the tree need to be displayed is then pushed to a higher level.
 * please expand as we know more

=Other Concerns=


 * Should we impose an artificial limit on the depth of a discussion tree?
 * Each topic within a discussion is its own FlowObject. Most likely the root node of the tree should be a pseudo-node that represents the topic itself.
 * Sort order for posts at the same depth?
 * How to handle topic splits?

= How deep is the tree?=

The depth of the discussion tree is important to consider, as the various solutions have different performance characteristics depending on how deep the tree is. Some require multiple recursive queries, some store more data, etc. User interfaces intended to show similar things sometimes set a small limit as display and user comprehension gets tough with unlimited nesting. Discussions within flow are not directly comparable to other systems, but it is likely relevant to consider what other systems which allow hierarchical replies(as opposed to forums which use quoting extensively) do.


 * The current UI prototype for flow commonly shows posts 5 levels deep and the jimmy wales page shows a few posts up to 10 levels deep.
 * Reddit does unlimited nesting, but only displays a certain depth hiding further responses behind either an ajax call to fill in more posts, or for deeply nested posts forces the user to load a separate page.

Other considerations
 * A brief discussion with design indicates they are open to allowing unlimited nesting.
 * If we enforce a maximum depth users will likely compensate with more topic splits
 * That might be prefered? not sure

= Possible Interface =

The goal here is to design an interface that can answer all our questions. If we decide we need to change our storage methods mid-development we should be able to change the implementation of this interface without changing the interface itself, which hopefully also means not changing the code that uses the interface. Sadly that is possibly a pie in the sky type dream, different implementations will have different performance properties and might be better served by adjusting the calling code to take advantage of which operations are faster/slower.

This interface is just a quick guess at what operations will be required. Might be better to take an "as if" approach. That is, write the front-end "as if" you already had a back-end implementation. Write the code whichever way is the most convenient. Then go back and look at what implementation you wish you had, and implement that.

interface Something { // Attach the new node $descendant as a child of $ancestor. If $ancestor is null then $descendant // is a new root node. function insert( $descendant, $ancestor = null ); // Given any node in storage, return its parents id or null if it is a root element function findParent( $descendant ); // Given any node in storage, find the path from that node // to the root of the tree. The root of the tree must be the first // element of the resulting array, and $descendant must be the last. function findRootPath( $descendant ); // Given any node in storage, find the root node of its tree. // A very naive implementation of this function would be: //    return reset( $this->findRootPath( $descendant ) ); // But there are likely much more optimal implementations depending on the storage method chosen. function findRoot( $descendant ); // Given any node in storage, return a multi-dimensional php array representing // the node and all its descendant's. If $maxDepth is non null then only fetch // up to the provided depth. function findSubTree( $root, $maxDepth = null ); }

= Possible Solutions =

Below are several possible solutions for storing hierarchical data within mysql and some benefits/downsides to each in relation to our query requirements. I am currently leaning towards the Key/Value approach with json stored inside the revision table, but more input from other developers is likely needed. Possibly we will want to throw together a simple implementation for the solutions we think have the most merit and stuff a test database with a few million posts to see what the indexes/queries/etc look like when analyzed.

Simple
Use an explicit flowPostSingularParentId or equivalent in the FlowPostSingular table.


 * Benefits:
 * Simple. Average SQL user can immediatly understand how it works.
 * Encodes the requirement that each post have exactly one parent directly into the data model
 * Downsides
 * More complex to fetch parts of the graph without fetching all of them
 * Requires either a recursive stored procedure or a join for every level of depth you want to query.
 * The right covering indexes may provide a middle ground with regards to memory/IO requirements to build the graphs
 * The right covering indexes may provide a middle ground with regards to memory/IO requirements to build the graphs

Edge-adjacency lists
Encodes the relationship between parent/child in an independent table. E.x. nodes table for data and edges table for relationships

(e.g. after a topic split )
 * Possibly allows for a single post to have multiple parents and be display in multiple places
 * Alternatively can use unique constraints to enforce single parent if desired.
 * Normalizing the edge relationships into a table that only contains edges and no other unrelated information can reduce IO and memory requirements inside the DB server when computing the graph.
 * Best performance will be achieved using stored procedures for the recursive queries, but stored procedures are very very far from being compatible between databases. Would almost need to write stored procedures for every db we want to support (just mysql?) and fallback SQL for alternative databases.
 * I think we can get away with just solving for MySQL - bh
 * Probably many more

Closure table
This is a denormalization of the edge-adjacency list. Where an edge-adjacency list stores individual edges as (parentId, childId), a Closure Table contains 3 values (parentId, descendentId, distance). Every element when inserted is linked to every node its parent is linked to, plus itself.


 * Benefits
 * No recursive queries
 * Queries are easily indexed
 * Of the options presented this is by far the simplest one but it trades off storage space for that simplicity.
 * Seems possible that the indexes are so large and spread out that the db server always has to hit the disk. May not be a real concern.
 * The sql to do the queries actually looks reasonable instead of massively complex.
 * Selecting all descendents of a specific node only requires querying for parentId = XxXxX.
 * Actually turning that into a graph requires locating the parent node for all nodes. Basically this means a query for id IN (list, of, nodes, in, tree) AND depth=1
 * Limiting to a specific depth is an additional condition against the distance field.
 * Assuming you had an index of (parentId, distance) a breadth first search for 100 nodes is a query with parentId=XxXxX order by depth limit YyY.
 * Downsides
 * Much larger table. Requires n*m space usage with n = number of nodes and m = average depth.  That works out to n^2 for worst case.  Its likely though that any discussion with large n will have a depth much much less than n. Some sort of special casing could also be done to artificially limit the depth.
 * Moving a sub-tree (for a topic split) is a pretty expensive operation
 * Denormalization means that the guarantees of always correct data are thrown out the window. Programming errors can cause data errors.
 * Unique index on (descendentId, distance) will ensure we dont have multiple paths to a node. There are likely plenty of other possible data errors though.

Materialized Path
Uses a single table with a two fields, id and path. Rather than storing an adjacency list of edges encodes the simple path from a node to its root node as a string.


 * Benefits
 * Subtree queries are a text prefix match, very simple
 * Subtree query depth can be limited by testing the string length of the path b
 * String length is not indexable in mysql ( i think )
 * For this to work all path nodes must be 0-prefixed in a future proof maner
 * Inserts are really simple, take the parent path and append the parents id to it.
 * Locating the parents/grandparent/etc is very simple


 * Downsides
 * 0 prefixing the keys will likely mean 12 or 13 bytes per node in the path.
 * Storing integer ids as a text string is very inefficient.
 * Arrays as a type were defined in sql 99, why cant mysql support sql 99 :-(
 * with 12 digit id's that means a post 3 levels down has a path size of 38 bytes (not the end of the world).
 * Indexing path strings may be a much larger index than indexing integers. Although its only one path string per node instead of many integers.

Nested Set

 * Benefits
 * Locating a sub tree is dead simple.
 * Downsides
 * Inserts are complex. In the worst case scenario you update every node in the tree.
 * Selecting a subtree is easy, but selecting a specific level or levels of that subtree is not. You have to select the whole sub-tree down to the bottom.
 * Not yet sure how to locate the parent/grandparent/etc of a node efficiently with nested sets.
 * To store multiple trees in the same table every node will need a tree identifier in addition to the left/right numbers

Use a graph database

 * I don't know anything about graph databases,
 * Has anyone on the team used graph databases?
 * Ops hasn't supported any graph databases in the past.

Key/Value store
Completely denormalize into an unqueryable(from SQL) json/xml/something structure and use the existing Revisions table as a key/value store.
 * "Simplest thing that could possibly work"
 * This is basically what wikidata does(store JSON in revisions and push all the work into the application layer)
 * Requires creating fake pages for everything in flow.
 * Must prevent anyone from directly editing these pages
 * Can't re-use the existing parser cache because it works at revision level but we would be storing JSON/XML at that level, with the wikitext nested inside it.
 * Could store both the wiki text and the rendered html directly in the data structure
 * The plan for Flow is to use the VE and parsoid from the get-go, perhaps can store the parsoid html directly(would need to know more about parsoid to make a decision here).
 * Updating any particular graph would require row level locking between fetching the tree and pushing back the updated tree.
 * The revision table already has code for this use case (preventing multiple people from editing the same document)
 * Flow would need to extend that though, because it would be rather inconvenient to tell someone to resubmit their comment, because somewhere else in the page someone else inserted a comment.
 * Avoids the mismatch between SQL's set based approach and a tree's graph based approach.
 * Doing operations on a tree of any reasonable size will possibly be quite slow from an interpreted language
 * XML is already a tree, if we store the tree structure in XML then the C libraries for xpath/etc can do some of that work for you
 * Anything that requires escaping (xml/json) in the content and full parsing of bulk strings is inefficient to parse
 * With a length-prefixed format it is just a matter of seeking around
 * As long as things easily map into JSON we should be good.

= Content Storage =

What is stored in the content?
Flow will be using the visual editor from the start rather than wiki text. Using the visual editor should allow us to store the marked up HTML + RDFa and skip the parsing step completely. The parsoid team already has a bug filed regarding HTML + RDFa storage. Its a low priority bug which they don't expect to start working on until July.

Where is the content stored?
The actual content of the discussion has to be stored somewhere. Of the various options above only one, the key/value store, leads itself to storing the content in the same record as the reply tree. The other tree storage solutions would need another solution. LiquidThreads something very similar to the current text storage of the Revisions table.

There are scalability issues related to the current Revision store, although not sure what. For review of how the revisions store their content:
 * revision table holds a foreign key to the text table
 * in production ExternalStore is used, with the DB backend
 * Saves a url to the text table which is then fetched through the external store
 * Randomly distributes new blob content if multiple db clusters are provided
 * currently mediawiki-config repo shows 2 clusters configured in $wgDefaultExternalStore for eqiad, which is used when inserting new Revision content.
 * inspecting the revision history, were these switched out of the write pool but still serve reads?
 * Sep 2012 - switched from cluster23 to cluster24 and cluster25
 * May 2012 - switched from cluster22 to cluster23
 * cluster22 was running as blob storage since at least feb 2012

This seems to be a reasonably scalable system, but asher will have more relevant information. A few possible issues:
 * If you store each post as its own blob you will end up with posts to render a full topic or discussion pulling from all possible clusters. Async requests are difficult in php(mysqlnd has support) so they would likely be done in a serial fashion which is not acceptable.
 * If you store the entire topic as a single blob(json/bjson/xml/etc) there will be a noticeable amount of duplicate content stored as the topic grows. If the history of servers in the write pool is an indication, it would be best to not fill them with duplicate content.
 * It is not easily possible with this system to split a database server into two, the modus operandi appears to be move servers in/out of the list of servers accepting new content as they fill up, and archive the older data as possible.

The next step would likely be a sharded solution. The sharded solution must work for two primary request patterns


 * Individual flow objects will be related to other flow objects about same article.
 * Individual user flows will be composed of flow objects from a wide variety of articles

These requirements naturally oppose each other when choosing a sharding key, but we must choose the lesser of two evils.
 * the wiki+ns+title offers a natural key that should always be available in a request to hash into a logical shard
 * keeps related FlowObjects on the same database server, preventing the need for fan-out queries when rendering a given FlowObject.

User subscriptions to flows present an interesting challenge in this respect. A users flow is composed of many different articles and may necessitate visiting multiple mysql servers.
 * A users flow will primarily be composed of recent items(are we sure?).
 * It may be plausible to keep view fragments of the recent items in memcached
 * How do we measure and/or estimate the memcache requirements for this?
 * a single multiGet operation can collect everything we need ( in the best case scenario).
 * For the worst case we may need to connect to a variety of servers, which will be slow
 * Can investigate mysqlnd with async queries for speciailzed use case?

= See Also =
 * a presentation from percona about hierarchical data in MySQL.
 * An excellent article on nested sets and adjacency list storage within mysql
 * Parsoid HTML + RDFa storage bug