Flow/Architecture/Discussion Storage

= Questions the data model must answer =

To really decide on a model for storing graphs, we need to first know what questions must be answered:


 * User X posts a reply to FlowPostSingular 123. Who needs to be notified?
 * All children of the same parent?
 * All parents up to the top level?
 * User Z starts viewing a conversation starting at FlowPostSingular 444. What posts need to be loaded?
 * All children of the same parent?
 * Recursively select all children and children of children up to max XXX items
 * Which items get pruned when limiting fetch to XXX items?
 * By depth from origional
 * By timestamp
 * Allowing limit by timestamp means we will need to include the timestamp in the node data rather than in a revisions type table.
 * Given a single post id, what FlowObject(GUID) does it map to ?
 * We shouldn't need to keep a copy of the GUID in every node, it should be sufficient to link from the metadata to the root node. Determining what flow object a post maps to involves looking up the root post linked to that post and then finding the metadata with that root post.
 * Some things wont naturally have a root post, for simplicity sake i think we should use a placeholder root with no content in those cases.
 * Some things wont naturally have a root post, for simplicity sake i think we should use a placeholder root with no content in those cases.

Important properties that the final storage solution must have:
 * please expand as we know more

= Possible Interface =

The goal here is to design an interface that can answer all our questions. If we decide we need to change our storage methods mid-development we should be able to change the implementation of this interface without changing the interface itself, which hopefully also means not changing the code that uses the interface. Sadly that is possibly a pie in the sky type dream, different implementations will have different performance properties and might be better served by adjusting the calling code to take advantage of which operations are faster/slower.

This interface is just a quick guess at what operations will be required. Might be better to take an sicp style approach, That is, write the front-end "as if" you already had a back-end implementation. Write the code whichever way is the most convenient. Then go back and look at what implementation you wish you had, and implement that.

Another possiblity, revolving around caching: Make a complementary object implementing this interface that wraps a single tree(rather than all possible trees). Provide it the full graph from the root node(pulled from some sort of cache) and perform these operations on that graph rather than inside the database. Dont do this immediatly, probably premature optimization.

interface Something { // Attach the new node $descendant as a child of $ancestor. If $ancestor is null then $descendant // is a new root node. function insert( $descendant, $ancestor = null ); // Given any node in storage, return its parents id or null if it is a root element function findParent( $descendant ); // Given any node in storage, find the path from that node // to the root of the tree. The root of the tree must be the first // element of the resulting array, and $descendant must be the last. function findRootPath( $descendant ); // Given any node in storage, find the root node of its tree. // A very naive implementation of this function would be: //    return reset( $this->findRootPath( $descendant ) ); // But there are likely much more optimal implementations depending on the storage method chosen. function findRoot( $descendant ); // Given any node in storage, return a multi-dimensional php array representing // the node and all its descendant's. If $maxDepth is non null then only fetch // up to the provided depth. function findSubTree( $root, $maxDepth = null ); }

= Possible Solutions =

Below are several possible solutions for storing hierarchical data within mysql and some benefits/downsides to each in relation to our query requirements. I am currently leaning twords either Closure Tables or Materialized Path, but more input from other developers is likely needed. Both of these solutions denormalize the data a bit so that querys can be preformed against indexes without needing to join tables. Possibly we will want to throw together a simple implementation for the solutions we think have the most merit and stuff a test database with a few million posts to see what the indexes/queries/etc look like when analyzed.

Naive
Use an explicit flowPostSingularParentId or equivalent in the FlowPostSingular table.


 * Benefits:
 * Simple. Average SQL user can immediatly understand how it works.
 * Encodes the requirement that each post have exactly one parent directly into the data model
 * Downsides
 * More complex to fetch parts of the graph without fetching all of them
 * Requires either a recursive stored procedure or a join for every level of depth you want to query.
 * The right covering indexes may provide a middle ground with regards to memory/IO requirements to build the graphs
 * The right covering indexes may provide a middle ground with regards to memory/IO requirements to build the graphs

Edge-adjacency lists
Encodes the relationship between parent/child in an independent table. E.x. nodes table for data and edges table for relationships

(e.g. after a topic split )
 * Possibly allows for a single post to have multiple parents and be display in multiple places
 * Alternatively can use unique constraints to enforce single parent if desired.
 * Normalizing the edge relationships into a table that only contains edges and no other unrelated information can reduce IO and memory requirements inside the DB server when computing the graph.
 * Best performance will be achieved using stored procedures for the recursive queries, but stored procedures are very very far from being compatible between databases. Would almost need to write stored procedures for every db we want to support (just mysql?) and fallback SQL for alternative databases.
 * I think we can get away with just solving for MySQL - bh
 * Probably many more

Closure table
This is a denormalization of the edge-adjacency list. Where an edge-adjacency list stores individual edges as (parentId, childId), a Closure Table contains 3 values (parentId, descendentId, distance). Every element when inserted is linked to every node its parent is linked to, plus itself.


 * Benefits
 * No recursive queries
 * Queries are easily indexed
 * Of the options presented this is by far the simplest one but it trades off storage space for that simplicity.
 * Seems possible that the indexes are so large and spread out that the db server always has to hit the disk. May not be a real concern.
 * The sql to do the queries actually looks reasonable instead of massively complex.
 * Selecting all descendents of a specific node only requires querying for parentId = XxXxX.
 * Actually turning that into a graph requires locating the parent node for all nodes. Basically this means a query for id IN (list, of, nodes, in, tree) AND depth=1
 * Limiting to a specific depth is an additional condition against the distance field.
 * Assuming you had an index of (parentId, distance) a breadth first search for 100 nodes is a query with parentId=XxXxX order by depth limit YyY.
 * Downsides
 * Much larger table. Requires n*m space usage with n = number of nodes and m = average depth.  That works out to n^2 for worst case.  Its likely though that any discussion with large n will have a depth much much less than n. Some sort of special casing could also be done to artificially limit the depth.
 * Moving a sub-tree (for a topic split) is a pretty expensive operation
 * Denormalization means that the guarantees of always correct data are thrown out the window. Programming errors can cause data errors.
 * Unique index on (descendentId, distance) will ensure we dont have multiple paths to a node. There are likely plenty of other possible data errors though.

Materialized Path
Uses a single table with a two fields, id and path. Rather than storing an adjacency list of edges encodes the simple path from a node to its root node as a string.


 * Benefits
 * Subtree queries are a text prefix match, very simple
 * Subtree query depth can be limited by testing the string length of the path b
 * String length is not indexable in mysql ( i think )
 * For this to work all path nodes must be 0-prefixed in a future proof maner
 * Inserts are really simple, take the parent path and append the parents id to it.
 * Locating the parents/grandparent/etc is very simple


 * Downsides
 * 0 prefixing the keys will likely mean 12 or 13 bytes per node in the path.
 * Storing integer ids as a text string is very inefficient.
 * Arrays as a type were defined in sql 99, why cant mysql support sql 99 :-(
 * with 12 digit id's that means a post 3 levels down has a path size of 38 bytes (not the end of the world).
 * Indexing path strings may be a much larger index than indexing integers. Although its only one path string per node instead of many integers.

Nested Set

 * Benefits
 * Locating a sub tree is dead simple.
 * Downsides
 * Inserts are complex. In the worst case scenario you update every node in the tree.
 * Selecting a subtree is easy, but selecting a specific level or levels of that subtree is not. You have to select the whole sub-tree down to the bottom.
 * Not yet sure how to locate the parent/grandparent/etc of a node efficiently with nested sets.

Use a graph database

 * I don't know anything about graph databases,
 * Has anyone on the team used graph databases?
 * Ops hasn't supported any graph databases in the past.

Key/Value store
Completely denormalize into an unqueryable json/xml/something structure and use mysql as a key/value store
 * Updating any particular graph would require row level locking between fetching the tree and pushing back the updated tree. :-(
 * Avoids the mismatch between SQL's set based approach and a tree's graph based approach.