Flow/Architecture/Discussion Storage

= Questions the data model must answer =

To really decide on a model for storing graphs, we need to first know what questions must be answered:


 * User X posts a reply to FlowPostSingular 123. Who needs to be notified?
 * All children of the same parent?
 * All parents up to the top level?
 * User Z starts viewing a conversation starting at FlowPostSingular 444. What posts need to be loaded?
 * All children of the same parent?
 * Recursively select all children and children of children up to max XXX items
 * Which items get pruned when limiting fetch to XXX items?
 * By depth from original
 * By timestamp
 * Allowing limit by timestamp means we will either need to include the last updated timestamp in the node data or join against the revisions table(bad idea i think) to get the timestamp.
 * other prune options???
 * Given a single post id, what FlowObject(GUID) does it map to ?
 * We shouldn't need to keep a copy of the GUID in every node, it should be sufficient to link from the metadata to the root node. Determining what flow object a post maps to involves looking up the root post linked to that post and then finding the metadata with that root post.
 * Not storing the GUID in every node should noticeably reduce the size of indexes required. Smaller indexes are almost always a win.
 * Some things won't naturally have a root post, for simplicity sake i think we should use a placeholder root with no content in those cases.

Important properties that the final storage solution must have:
 * Must be able to remember which posts a user has seen, and filter the result based on that
 * Possibly this will be better served happening outside the proposed storage interface. Potentially the interface is just returning the structure of the discussion, but without any of the discussion specific information.  Figuring out which parts of the tree need to be displayed is then pushed to a higher level.
 * please expand as we know more

=Other Concerns=


 * Should we impose an artificial limit on the depth of a discussion tree?
 * Are the root nodes of the tree the individual topics, or are those topics the second level with the first level being a pseudo-node representing the specific FlowObject?
 * The pseudo-node makes linking from the WorkflowObject to all its topics fairly simple
 * If the topics aren't linked inside the tree then a separate join table must be used to hold the One to Many relationship between a WorkflowObject and its topics
 * Alternatively, each topic could be a FlowObject?
 * Would topics benefit from having their own metadata?
 * Leaning towards each topic being its own WorkflowObject.
 * Sort order for posts at the same depth?
 * How to handle topic splits?

= How deep is the tree?=

The depth of the discussion tree is important to consider, as the various solutions have different performance characteristics depending on how deep the tree is. Some require multiple recursive queries, some store more data, etc. User interfaces intended to show similar things sometimes set a small limit as display and user comprehension gets tough with unlimited nesting. Discussions within flow are not directly comparable to other systems, but it is likely relevant to consider what other systems which allow hierarchical replies(as opposed to forums which use quoting extensively) do.


 * The current UI prototype for flow commonly shows posts 5 levels deep and the jimmy wales page shows a few posts up to 10 levels deep.
 * Reddit does unlimited nesting, but only displays a certain depth hiding further responses behind either an ajax call to fill in more posts, or for deeply nested posts forces the user to load a separate page.

Other considerations
 * A brief discussion with design indicates they are open to allowing unlimited nesting.
 * If we enforce a maximum depth users will likely compensate with more topic splits
 * That might be prefered? not sure

= Possible Interface =

The goal here is to design an interface that can answer all our questions. If we decide we need to change our storage methods mid-development we should be able to change the implementation of this interface without changing the interface itself, which hopefully also means not changing the code that uses the interface. Sadly that is possibly a pie in the sky type dream, different implementations will have different performance properties and might be better served by adjusting the calling code to take advantage of which operations are faster/slower.

This interface is just a quick guess at what operations will be required. Might be better to take an "as if" approach. That is, write the front-end "as if" you already had a back-end implementation. Write the code whichever way is the most convenient. Then go back and look at what implementation you wish you had, and implement that.

interface Something { // Attach the new node $descendant as a child of $ancestor. If $ancestor is null then $descendant // is a new root node. function insert( $descendant, $ancestor = null ); // Given any node in storage, return its parents id or null if it is a root element function findParent( $descendant ); // Given any node in storage, find the path from that node // to the root of the tree. The root of the tree must be the first // element of the resulting array, and $descendant must be the last. function findRootPath( $descendant ); // Given any node in storage, find the root node of its tree. // A very naive implementation of this function would be: //    return reset( $this->findRootPath( $descendant ) ); // But there are likely much more optimal implementations depending on the storage method chosen. function findRoot( $descendant ); // Given any node in storage, return a multi-dimensional php array representing // the node and all its descendant's. If $maxDepth is non null then only fetch // up to the provided depth. function findSubTree( $root, $maxDepth = null ); }

= Possible Solutions =

Below are several possible solutions for storing hierarchical data within mysql and some benefits/downsides to each in relation to our query requirements. I am currently leaning towards either Closure Tables or Materialized Path, but more input from other developers is likely needed. Both of these solutions denormalize the data a bit so that queries can be performed without needing as many complex join operations. Because they are denormalized they take more storage space and by proxy cant keep as many posts in memory. Possibly we will want to throw together a simple implementation for the solutions we think have the most merit and stuff a test database with a few million posts to see what the indexes/queries/etc look like when analyzed.

Naive
Use an explicit flowPostSingularParentId or equivalent in the FlowPostSingular table.


 * Benefits:
 * Simple. Average SQL user can immediatly understand how it works.
 * Encodes the requirement that each post have exactly one parent directly into the data model
 * Downsides
 * More complex to fetch parts of the graph without fetching all of them
 * Requires either a recursive stored procedure or a join for every level of depth you want to query.
 * The right covering indexes may provide a middle ground with regards to memory/IO requirements to build the graphs
 * The right covering indexes may provide a middle ground with regards to memory/IO requirements to build the graphs

Edge-adjacency lists
Encodes the relationship between parent/child in an independent table. E.x. nodes table for data and edges table for relationships

(e.g. after a topic split )
 * Possibly allows for a single post to have multiple parents and be display in multiple places
 * Alternatively can use unique constraints to enforce single parent if desired.
 * Normalizing the edge relationships into a table that only contains edges and no other unrelated information can reduce IO and memory requirements inside the DB server when computing the graph.
 * Best performance will be achieved using stored procedures for the recursive queries, but stored procedures are very very far from being compatible between databases. Would almost need to write stored procedures for every db we want to support (just mysql?) and fallback SQL for alternative databases.
 * I think we can get away with just solving for MySQL - bh
 * Probably many more

Closure table
This is a denormalization of the edge-adjacency list. Where an edge-adjacency list stores individual edges as (parentId, childId), a Closure Table contains 3 values (parentId, descendentId, distance). Every element when inserted is linked to every node its parent is linked to, plus itself.


 * Benefits
 * No recursive queries
 * Queries are easily indexed
 * Of the options presented this is by far the simplest one but it trades off storage space for that simplicity.
 * Seems possible that the indexes are so large and spread out that the db server always has to hit the disk. May not be a real concern.
 * The sql to do the queries actually looks reasonable instead of massively complex.
 * Selecting all descendents of a specific node only requires querying for parentId = XxXxX.
 * Actually turning that into a graph requires locating the parent node for all nodes. Basically this means a query for id IN (list, of, nodes, in, tree) AND depth=1
 * Limiting to a specific depth is an additional condition against the distance field.
 * Assuming you had an index of (parentId, distance) a breadth first search for 100 nodes is a query with parentId=XxXxX order by depth limit YyY.
 * Downsides
 * Much larger table. Requires n*m space usage with n = number of nodes and m = average depth.  That works out to n^2 for worst case.  Its likely though that any discussion with large n will have a depth much much less than n. Some sort of special casing could also be done to artificially limit the depth.
 * Moving a sub-tree (for a topic split) is a pretty expensive operation
 * Denormalization means that the guarantees of always correct data are thrown out the window. Programming errors can cause data errors.
 * Unique index on (descendentId, distance) will ensure we dont have multiple paths to a node. There are likely plenty of other possible data errors though.

Materialized Path
Uses a single table with a two fields, id and path. Rather than storing an adjacency list of edges encodes the simple path from a node to its root node as a string.


 * Benefits
 * Subtree queries are a text prefix match, very simple
 * Subtree query depth can be limited by testing the string length of the path b
 * String length is not indexable in mysql ( i think )
 * For this to work all path nodes must be 0-prefixed in a future proof maner
 * Inserts are really simple, take the parent path and append the parents id to it.
 * Locating the parents/grandparent/etc is very simple


 * Downsides
 * 0 prefixing the keys will likely mean 12 or 13 bytes per node in the path.
 * Storing integer ids as a text string is very inefficient.
 * Arrays as a type were defined in sql 99, why cant mysql support sql 99 :-(
 * with 12 digit id's that means a post 3 levels down has a path size of 38 bytes (not the end of the world).
 * Indexing path strings may be a much larger index than indexing integers. Although its only one path string per node instead of many integers.

Nested Set

 * Benefits
 * Locating a sub tree is dead simple.
 * Downsides
 * Inserts are complex. In the worst case scenario you update every node in the tree.
 * Selecting a subtree is easy, but selecting a specific level or levels of that subtree is not. You have to select the whole sub-tree down to the bottom.
 * Not yet sure how to locate the parent/grandparent/etc of a node efficiently with nested sets.
 * To store multiple trees in the same table every node will need a tree identifier in addition to the left/right numbers

Use a graph database

 * I don't know anything about graph databases,
 * Has anyone on the team used graph databases?
 * Ops hasn't supported any graph databases in the past.

Key/Value store
Completely denormalize into an unqueryable(from SQL) json/xml/something structure and use the existing Revisions table as a key/value store.
 * "Simplest thing that could possibly work"
 * This is basically what wikidata does(store JSON in revisions and push all the work into the application layer)
 * Requires creating fake pages for everything in flow.
 * Must prevent anyone from directly editing these pages
 * Can't re-use the existing parser cache because it works at revision level but we would be storing JSON/XML at that level, with the wikitext nested inside it.
 * Could store both the wiki text and the rendered html directly in the data structure
 * The plan for Flow is to use the VE and parsoid from the get-go, perhaps can store the parsoid html directly(would need to know more about parsoid to make a decision here).
 * Updating any particular graph would require row level locking between fetching the tree and pushing back the updated tree.
 * The revision table already has code for this use case (preventing multiple people from editing the same document)
 * Flow would need to extend that though, because it would be rather inconvenient to tell someone to resubmit their comment, because somewhere else in the page someone else inserted a comment.
 * Avoids the mismatch between SQL's set based approach and a tree's graph based approach.
 * Doing operations on a tree of any reasonable size will possibly be quite slow from an interpreted language
 * XML is already a tree, if we store the tree structure in XML then the C libraries for xpath/etc can do some of that work for you
 * Anything that requires escaping (xml/json) in the content and full parsing of bulk strings is inefficient to parse
 * With a length-prefixed format it is just a matter of seeking around
 * As long as things easily map into JSON we should be good.