User:Dantman/Code Ideas/Text dump algorithm


 * 1) For {page} in {pages} where {page_id} % {totalThreads} = {threadIndex}
 * 2) Let {Q} be a SQL query for {rev_page} = {page->id} ordered in reverse by {rev_timestamp}
 * 3) Let {RBUF} be an empty Queue
 * 4) Let {SHEAP} be a MaxHeap that sorts by {I->Size}
 * 5) Begin reading from {Q}. For each row:
 * 6) Let {Text} be the contents found by fetching the row fetched from {Q}
 * 7) Let {R} be an object
 * 8) Let {R->Text} be {Text}
 * 9) Let {R->Size} be the strlen of {Text}
 * 10) Let {R->Inserted} be false
 * 11) Push {R} onto the end of {RBUF}
 * 12) Insert {R} into {SHEAP}
 * 13) If {RBUF} has {queueLength} items or {Q} has no more items break from this loop
 * 14) Let {R} be the first item in {SHEAP}
 * 15) Let {} be a MinHeap that sorts by {->Size}
 * 16) Iterate over the next {comparisonLookahead} items in {SHEAP} as {}:
 * 17) Compute a delta between {R->Text} and {->Text} as {Delta}
 * 18) Calculate the storage space of {Delta} as {DeltaSize} # Note: Removals will be optimized and will only contain ranges needed for removal, not the text itself
 * 19) Let {->Rev} be {}
 * 20) Let {->Delta} be {Delta}
 * 21) Let {->Size} be {DeltaSize}
 * 22) Insert {} into {}


 * Dumping is run in parallel where each thread is responsible for pages with a MOD {totalThreads} that is equal to the thread's own index. (If three threads are being run then thread i=2 is responsible for the page id's (2, 5, 8, etc...)).
 * Because revisions within a page are most likely to be similar we break up tasks by pages and make all revisions in one page handled by a single thread.
 * Revision text may be stored in Raw, Delta, or Equal formats. Raw being a blob of text. Delta being a space optimized diff and a reference to another revision. And Equal being a reference to another revision indicating they are the same.
 * Deltas are stored with preference going towards storage of removals
 * We iterate in reverse as newer revisions are more likely to have more text.
 * We handle the largest revisions in a chunk of revisions first to make it most likely that we will store a removal.
 * Storage:
 * Use binary storage patterns.
 * Use fseek to return to the location previously unknown index pointers and write to them
 * Use tempnam to create a temporary file, write to it, and then move it to the final location.