User:ArielGlenn/Dumps new format (deltas, changesets)

Open issues on GSOC 2013 project[edit]

easy todos[edit]

did the benchmarks for randall's idea get committed anywhere?
mail to the list announcing format and code

required for production[edit]

need to have these be monitorable, output that shows up index.html and people watch the progress etc
make sure that reading from stdin is enough for people using XML
- what do we do about folks that need two streams, i.e. stubs plus pages?
- what about creating the multistream dumps?
adding new fields to the XML: let's document how that's going to work so people will know in advance
- XML spec file will always be in the same public location (point those out)
- need spec file for the dumps that is always in sync in likewise a public location
- how much of a pita will it be to add new fields?
length or sha1 checks when retrieving metadata? (not mentioned yet but on my list of 'needs')
- if we rely on the maintenance scripts, these do it already
be able to run in pieces as we do today (if speed demands it, which sooner or later it will)
- allows us to make use of multiple cores, speed up a given run too
- allows file to be in chunks that are downloadable
- allows us to recover by rerunning one piece (checkpoint) instead of the whole thing
be able to easily recombine pieces as needed
- we do this for stubs and everything else but history already
be able to recover from the middle if something dies or is corrupt (instead of losing the whole file): this is something bz2 gives us

check with users about these[edit]

have optional textid index. because maybe we want that or some people want it.
do people want the <restrictions> tag?

delta compression[edit]

size comparisons: how good is delta compression compared to: bz2, 7z?
so we are at maybe 20% slower than gzip with delta compression, what do we think about that?
are we going to have memory issues?
consider the worst case for some of these pages with ginormous numbers of revisions:
Wikipedia:Administrator intervention against vandalism -- 931589 revisions. So huge.
how long does uncompression take?
when implemented for the dumps, we will use it for new revisions of an existing page (?)
what happens to current page dumps? how will compression be used for these?
which compression library will be used? what properties will it have?
retrieval of old revisions is expensive if we start with the first one as the reference one?

pending[edit]

deletions, suppressions, undeletions, moves...
"free space" management

timing tests[edit]

how long does it take to merge in in incremental to a previous "full"?
how long does it take to write out a gz or bz2 or 7z compressed or whatever file from XML output from the new full?
how long does it take to write out an uncompressed XML file from a new format 'full'?

distant future[edit]

alt formats like sqlite?
unit testing

out of curiosity[edit]

do we have any cases where namespaces were removed on wmf projects?
what does import.php do with attributed that are labeled as deleted in the xml file?
see if there are any comments with non utf8 in the middle on any project

Brain dump before GSOC 2013 started[edit]

Here's some notes that might be useful for Dumps v2.0, a New Hope^WFormat.

Thie idea was proposed here and is now on our GSOC list here. A discussion was started on the xml data dumps mailing list here. Bug report is here.

These notes are just my brain dump of 'random thoughts I had about this project', not a definitive guideline in any way.

Let's say we wanted to move to a new output format to make things easier on the endusers, so they could download 'only the deltas'. What would that look like?

Don't download the entire bz2 file, download just the bits that are new, plus a list of records to delete (and apply the deletion)

So this is the ideal. Now here are the real-world constraints:

Compression is best when done over a large pile of data (900K let's say but certainly greater than 1k). This means that compressing each revision is out of the question, while it would be much easier for this format it would also be unworkable as far as size.
This means we are looking at compressing a number of revisions together, either grouped by page or grouped by revision number (age).
We need to alter contents of a member in the archive when a revision contained in the member goes away, (is deleted). We think an old revision would never be altered. Errrr... Weren't we thinking that we would rewrite old revisions to make the length be right? Or were we going to just update the length... we were going to do something, I forget what. We better have a look at that.
It is possible that a bot comes through and edits every page or many many of the pages in a month, say adding a category or something like that. In this case we would need to update every single compressed member if we compressed per page, and this would mean uncompression and recompression of the entire archive. Obviously we don't gain anything with this approach.
If we compress every X revisions depending on size when they are glommed together, we're probably better off as far as whether we have to uncompress stuff for the rewrite. A page might get deleted, which would mean we uncompress a block, remove revisions that have to do with the specified page, and then we compress and put it back. Additionally we might have a specific revision get hidden (how often would this happen for an old revision? Rarely I think), so we would uncompress, remove the offending revision, and then recompress. This would affect older revisions as well as newer ones. But perhaps less often the older ones.
We would need to write cross-platform libraries/scripts to convert from this format to format of user's choice (bz2, 7z)
It would be nice to have a library that would let folks do the equivalent of a grep for some string
Convenience library should let folks retrieve all revisions in order by revid for a given page (this means finding the first block with some revs, writing those out, finding the next such block, etc until we get to the last such block, so a chain of these, so rather a bit of random access).
If someone wanted to generate output that had each page with their revisions in order, it would be a bit slower than reading from a file where the reads could all be serial. How much? Maybe if we are lucky the overhead of the decompression would be larger than the cost of the random access. But given that we won't compress revisions as separate items, nor even pages as separate items, we'll wind up uncompressing a number of pages together in order to get to a new item. meh.
If someone restores a page, after deletion, it gets a new page id (we don't care so much about that) but also those revisions with the rev ids come back. So it might be nice to hold space for those in the file (what does it mean to 'hold space', hold markers? Have an index that has 0 entries for those rev ids and when / if those entries are restored we put a pointer in there to a block and offset? I guess they could be stored at the end of the archive along with new pages etc. Does this leave gaps in the archive that have to be filled with zeros? Potentially. If there is already a gap with zeros we could put the revision there? Well... once again we would be uncompressing some block in the middle of the archive, adding the revision, recompressing and hoping it fits. We would only know it fits if we had previously removed it. In that case we probably do just want to 'put it back' because otherwise over time as more and more old things are deleted and restored, we wind up with a bunch of unused space in there. I dunno how much, maybe if it only ever reaches 1/10 of the archive size we can live with it, for en wp, and for any smaller wiki the archive won't be that large so even if someone adds two million articles, some real pages get added, and then the 2 million articles are deleted, we wind up with space for those two million... hmm. That's really a lot, it argues that we want to be smart about deletions somehow. Ugh. Clearly we want to be able to aggregate and re-use space within the archive. This might be a sort of 'defrag' that could be done if the dead space in block N through M in the archive is greater than X% or Y absolute size. In this case we would uncompress blocks N through M, re-combine contents skipping over zeroed stuff and writing out filled blocks, and leaving the zeroed stuff contiguous. Now this zeroed stuff would be added to a free list which could then be reused by the archive on the next round. So what does this mean, that we would write new revisions into that region. If we were to do that, we now have potentially (depending on when those two million articles were discovered and removed) a bunch of new data before a bunch of older data. is this bad? well if we write by rev id order I still think it's ok... the new revisions are more likely to be 'disturbed' than the old ones, we just avoid mixing them too much. *A case where we would have a bunch of corrections to a lot of data under this scheme: a bot goes through and edits a bunch of pages, then needs to revert those cause of a mistake. In this case, assume there are other edits going on simultaneously (other pages etc). So we have say half of the revisions for a day being bad, and the other half being fine. So we might have ... well if they *delete* the revisions, which is unlikely, we have to go through the blocks with the newest x days of revisions and toss a bunch of stuff. I think this is ok actually.

OK so the questions will be:

how many revisions in a packet?
how do we store the rest of the metadata about the revision?
how do we permit retrieval of all revisions for a page easily?
(i.e. the list of revisions per page must be stored someplace and updated each time we mess about with the archive)
given that we will treat each revision as an 'item' even though we compress more than one item together, what does the toc for finding an item in the file look like?
what compression do we use? want:
- really efficient for large and small amounts of data
- fast uncompress
- better compress than bz2
- does NOT need to be block oriented, we do blocks by having each collection of items compressed separately

Storing the list of revisions per page:

Stub files in gz format are fast to read and not too huge, so we could write those in the usual way, I claim. this would also be our list of revisions per page. But it's not block formatted so someone wanting the list of revisions per page would suddenly be at a disadvantage. So we need a list stored in a format that permits random access, and either we use binary search or we ...

How bad is sqlite for 100 million records?? :-D :-D well it seems to say that we could do it and be ok. I wonder if that's a good format or not.

I'd prefer to have bz2, have an index into blocks so that someone could get those lists... have them written in order (wait can we guarantee that? Yes), and let someone retrieve either randomly with a tiny delay (= binary search or consult an index), or serially with no delay, except always for the decompression, which might have extra overhead for the random access case, as we will have multiple items compressed together and it might take a little time to dig out the user's entry from the compressed block.

This file will be written in page id order so binary search is possible and a fine substitute for having someone use sqlite. So in other words we could use a bz2 multistream file for this which would hmm. This would allow normal scripting tools to work on the file, which we want. Need to see how long it would take to compress and uncompress for bz2.

Here's a bit of timing info:

root@dataset1001:/data/xmldatadumps/public/enwiki/20130304# date; zcat enwiki-20130304-stub-meta-history27.xml.gz | bzip2 > /root/enwiki-20130304-stub-meta-history27.xml.bz2; date
Mon Mar 25 09:16:14 UTC 2013
Mon Mar 25 10:15:51 UTC 2013

ugh too long. don't want to do that more than once.

So this will tell us how long it takes to wade through the gz file and write out the bz2 file maybe not too bad, I'd rather do this and produce these files, reading through such a file for a person trying to get their page to revision list is probably going to be prohibitively slow.

If we have 100 million pages then compressing together info (page number, revision numbers) for oh say 1k of them means we have 100k entries, this isn't bad at all. Bear in mind that some pages have a lot of revisions... really lots and lots. As in several hundred thousand.