User:Dantman/Code Ideas/Text dump algorithm

Dumping is run in parallel where each thread is responsible for pages with a MOD {totalThreads} that is equal to the thread's own index. (If three threads are being run then thread i=2 is responsible for the page id's (2, 5, 8, etc...)).
Because revisions within a page are most likely to be similar we break up tasks by pages and make all revisions in one page handled by a single thread.
Revision text may be stored in Raw, Delta, or Equal formats. Raw being a blob of text. Delta being a space optimized diff and a reference to another revision. And Equal being a reference to another revision indicating they are the same.
Deltas are stored with preference going towards storage of removals
- We iterate in reverse as newer revisions are more likely to have more text.
- We handle the largest revisions in a chunk of revisions first to make it most likely that we will store a removal.
Storage:
- Use binary storage patterns.
- Use fseek to return to the location previously unknown index pointers and write to them
- Use tempnam to create a temporary file, write to it, and then move it to the final location.