User:Dantman/Code Ideas/Text dump algorithm

From mediawiki.org
  1. For {page} in {pages} where {page_id} % {totalThreads} = {threadIndex}
    1. Let {Q} be a SQL query for {rev_page} = {page->id} ordered in reverse by {rev_timestamp}
    2. Let {RBUF} be an empty Queue
    3. Let {SHEAP} be a MaxHeap that sorts by {I->Size}
    4. Begin reading from {Q}. For each row:
      1. Let {Text} be the contents found by fetching the row fetched from {Q}
      2. Let {R} be an object
      3. Let {R->Text} be {Text}
      4. Let {R->Size} be the strlen() of {Text}
      5. Let {R->Inserted} be false
      6. Push {R} onto the end of {RBUF}
      7. Insert {R} into {SHEAP}
      8. If {RBUF} has {queueLength} items or {Q} has no more items break from this loop
    5. Let {R} be the first item in {SHEAP}
      1. Let {} be a MinHeap that sorts by {->Size}
      2. Iterate over the next {comparisonLookahead} items in {SHEAP} as {}:
        1. Compute a delta between {R->Text} and {->Text} as {Delta}
        2. Calculate the storage space of {Delta} as {DeltaSize} # Note: Removals will be optimized and will only contain ranges needed for removal, not the text itself
        3. Let {->Rev} be {}
        4. Let {->Delta} be {Delta}
        5. Let {->Size} be {DeltaSize}
        6. Insert {} into {}
  • Dumping is run in parallel where each thread is responsible for pages with a MOD {totalThreads} that is equal to the thread's own index. (If three threads are being run then thread i=2 is responsible for the page id's (2, 5, 8, etc...)).
  • Because revisions within a page are most likely to be similar we break up tasks by pages and make all revisions in one page handled by a single thread.
  • Revision text may be stored in Raw, Delta, or Equal formats. Raw being a blob of text. Delta being a space optimized diff and a reference to another revision. And Equal being a reference to another revision indicating they are the same.
  • Deltas are stored with preference going towards storage of removals
    • We iterate in reverse as newer revisions are more likely to have more text.
    • We handle the largest revisions in a chunk of revisions first to make it most likely that we will store a removal.
  • Storage:
    • Use binary storage patterns.
    • Use fseek to return to the location previously unknown index pointers and write to them
    • Use tempnam to create a temporary file, write to it, and then move it to the final location.