Offline content generator/Bundle format

We're still working on documenting this.
 * metabook.json
 * Containing some version of the "metabook" data. For multi-wiki zips, the per-wiki file will contain just the "items" key with articles for this wiki. This (and nfo.json) are basically the input to the spidering process.


 * nfo.json
 * JSON object with three key-value pairs: "format" being "nuwiki", and "base_url" and "script_extension" copied from the data posted to mw-serve. This is the metadata information needed to allow the spider to make API requests to the appropriate wiki.


 * siteinfo.json
 * The output from the API's action=query&meta=siteinfo&siprop=general|namespaces|interwikimap|namespacealiases|magicwords|rightsinfo


 * licenses.json
 * Containing JSON license data. Content is an array of objects, each corresponding to an object from metabook.json's "licenses" field. Fields in each object are "type", "title", and "wikitext". "type" is always "license" and "title" is taken from the 'name' key in the metabook.json object. As for "wikitext": If the field 'mw_license_url' was present, it is fetched (typically the url is along the lines of index.php?title=???&action=raw&templates=expand, "&templates=expand" is appended if it matches the regex  and doesn't already contain "templates=expand"). Otherwise, it's basically whichever exists out of 'mw_rights_text', 'mw_rights_page' (wrapped in ), and 'mw_rights_url' all concatenated with two newlines as separator.


 * redirects.json
 * Containing information on redirects. Content appears to be a map from original title to new title. This appears to be used to bypass redirects when fetching page content from the bundle. For example, if the book contained "Foo" which redirects to "Foobar", it might store only the content of "Foobar" and an entry in redirects.json mapping "Foo" to "Foobar", so later when the bundle fetcher is asked for "Foo" it will return the content of "Foobar".


 * authors.db
 * sqlite database containing author info. Keys are mediawiki titles (eg, ,  ) and the value is a JSON-encoded array of mediawiki usernames (eg,  ).  Note the presence of "ANONIPEDITS:&lt;number&gt;" which notes how many anonymous editors' IP addresses have been elided from the list.


 * html.db
 * sqlite database containing output from action=parse. Keys are revision ids.  Values are the output as a JSON structure.


 * parsoid.db
 * Experimental addition: parsoid parser output, equivalent to html.db


 * imageinfo.db
 * sqlite database containing image info, from the MW API prop=imageinfo&iiprop=url|user|comment|url|sha1|size&iiurlwidth=$width. Keys are mediawiki titles (eg, ).  Values are JSON-encoded objects such as


 * revisions-1.txt
 * File containing multiple records of json data. Includes the output of action=expandtemplates for all pages in the book, some other API queries for pages in the book, and image pages for images in the book, possibly among other things. There appears to be no indication of the original queries, just the data.


 * images
 * Directory containing images. Filenames are from MediaWiki with localized "File:" prefix, with tildes replaced with "" and all non-ASCII characters plus slash and backslash replaced with "~%d~" where %d is the Unicode codepoint for the character.

All sqlite databases have a single table named with the following schema:
CREATE TABLE kv_table (key TEXT PRIMARY KEY, val TEXT); That is, they are simple key/value maps. The keys and values are described above.