Flow/Architecture/Discussion Storage

= Questions the data model must answer =

To really decide on a model for storing graphs, we need to first know what questions must be answered:


 * User X posts a reply to FlowPostSingular 123. Who needs to be notified?
 * All children of the same parent?
 * All parents up to the top level?
 * User Z starts viewing a conversation starting at FlowPostSingular 444. What posts need to be loaded?
 * All children of the same parent?
 * Recursively select all children and children of children up to max XXX items
 * Which items get pruned when limiting fetch to XXX items?
 * By depth from origional
 * By timestamp

= Possible Solutions = asdf

Naive
Use an explicit flowPostSingularParentId or equivalent in the FlowPostSingular table.


 * Benefits:
 * Simple. Average SQL user can immediatly understand how it works.
 * Encodes the requirement that each post have exactly one parent directly into the data model
 * Downsides
 * More complex to fetch parts of the graph without fetching all of them
 * Requires either a recursive stored procedure or a join for every level of depth you want to query.
 * The right covering indexes may provide a middle ground with regards to memory/IO requirements to build the graphs
 * The right covering indexes may provide a middle ground with regards to memory/IO requirements to build the graphs

Edge-adjacency lists
Encodes the relationship between parent/child in an independent table. E.x. nodes table for data and edges table for relationships

(e.g. after a topic split )
 * Possibly allows for a single post to have multiple parents and be display in multiple places
 * Alternatively can use unique constraints to enforce single parent if desired.
 * Normalizing the edge relationships into a table that only contains edges and no other unrelated information can reduce IO and memory requirements inside the DB server when computing the graph.
 * Best performance will be achieved using stored procedures for the recursive queries, but stored procedures are very very far from being compatible between databases. Would almost need to write stored procedures for every db we want to support (just mysql?) and fallback SQL for alternative databases.
 * I think we can get away with just solving for MySQL - bh
 * Probably many more

Closure table
This is a denormalization of the edge-adjacency list. Where an edge-adjacency list stores individual edges as (parentId, childId), a Closure Table contains 3 values (parentId, descendentId, distance). Every element when inserted is linked to every node its parent is linked to, plus itself.


 * Benefits
 * No recursive queries
 * Queries are easily indexed
 * Of the options presented this is by far the simplest one but it trades off storage space for that simplicity.
 * Seems possible that the indexes are so large and spread out that the db server always has to hit the disk. May not be a real concern.
 * The sql to do the queries actually looks reasonable instead of massively complex.
 * Selecting all descendents of a specific node only requires querying for parentId = XxXxX.
 * Limiting to a specific depth is an additional condition against the distance field.
 * Assuming you had an index of (parentId, distance) a breadth first search for 100 nodes is a query with parentId=XxXxX order by depth limit YyY.
 * Downsides
 * Much larger table. Requires n*m space usage with n = number of nodes and m = average depth.  That works out to n^2 for worst case.  Its likely though that any discussion with large n will have a depth much much less than n. Some sort of special casing could also be done to artificially limit the depth.
 * Moving a sub-tree (for a topic split) is a pretty expensive operation
 * Denormalization means that the guarantees of always correct data are thrown out the window. Programming errors can cause data errors.

Materialized Path
Uses a single table with a two fields, id and path. Rather than storing an adjacency list of edges encodes the simple path from a node to its root node as a string.

Nested Set

 * Benefits
 * Downsides
 * Inserts are complex. In the worst case scenario you update every node in the tree.
 * Selecting a subtree is easy, but selecting a specific level or levels of that subtree is not.

Use a graph database

 * I don't know anything about graph databases,
 * Has anyone on the team used graph databases?
 * Ops hasn't supported any graph databases in the past.

Key/Value store
Completely denormalize into an unqueryable json/xml/something structure and use mysql as a key/value store
 * Updating any particular graph would require row level locking between fetching the tree and pushing back the updated tree. :-(
 * Avoids the mismatch between SQL's set based approach and a tree's graph based approach.