User:Dantman/User:Svick/Incremental dumps/File format/Varints

From mediawiki.org

With fixed storage an unsigned 32 bit int stores every number 0–4,294,967,295 with 4 bytes.

With varint storage an unsigned 32 bit int stores:

  • 0–128 with 1 byte
  • 129 – 16,384 with 2 bytes
  • 16,385 – 2,097,152 with 3 bytes
  • 2,097,153 – 268,435,456 with 4 bytes
  • and 268,435,456 – 4,294,967,295 with 5 bytes (though if you expect to have a lot in this range you can use (s)fixed32 which will use fixed-storage and hence 4 bytes for everything; personally I think if Protocol Buffers wanted to they could make it so that (u/s)int32 would never use 5 bytes if they felt like it simply by varying the wire type they use based on the value)

Taking a few extreme revisions from Wikipedia (EN):

  • A) rev_id = 5, parent_id = 0, timestamp = 2002-01-26T14:32:30Z (1,012,055,550)
  • B) rev_id = 16,000, parent_id = 15,562, timestamp = 2002-02-22T11:30:30Z (1,014,377,430)
  • C) rev_id = 2,000,001, parent_id = 1,933,236, timestamp = 2003-12-11T18:48:43Z (1,071,168,523)
  • D) rev_id = 4,000,000, parent_id = 3,945,385, timestamp = 2004-06-05T02:42:21Z (1,086,403,341)
  • E) rev_id = 562,467,155, parent_id = 562,466,994, timestamp = 2013-07-01T23:43:52Z (1,372,722,232)

The storage requirements for,

  • a.1) A proprietary format using rev_id = uint32, parent_id = uint32, timestamp = int32:
    • A,B,C,D,E) 4B + 4B + 4B = 12B
  • a.2) A proprietary format using rev_id = uint32, parent_id = uint32, timestamp = int64:
    • A,B,C,D,E) 4B + 4B + 8B = 16B
  • b.1) A Protocol Buffer format using rev_id = uint32, parent_id = uint32, timestamp = int32/int64:
    • A) 3B + 1B + 1B + 5B = 10B
    • B) 3B + 2B + 2B + 5B = 12B
    • C) 3B + 3B + 3B + 5B = 14B
    • D) 3B + 4B + 4B + 5B = 16B
    • E) 3B + 4B + 4B + 5B = 16B
  • b.2) A Protocol Buffer format using rev_id = uint32, parent_id = uint32, timestamp = sfixed32:
    • A) 3B + 1B + 1B + 4B = 9B
    • B) 3B + 2B + 2B + 4B = 11B
    • C) 3B + 3B + 3B + 4B = 13B
    • D) 3B + 4B + 4B + 4B = 15B
    • E) 3B + 4B + 4B + 4B = 15B

Using a dump metadata stored time epoch of (unix epoch + 1,010,000,000) instead of unixtime's epoch those end up being:

  • b.3) A Protocol Buffer format using rev_id = uint32, parent_id = uint32, timestamp = int32/int64:
    • A) 3B + 1B + 1B + 3B = 8B
    • B) 3B + 2B + 2B + 4B = 11B
    • C) 3B + 3B + 3B + 4B = 13B
    • D) 3B + 4B + 4B + 4B = 15B
    • E) 3B + 4B + 4B + 5B = 16B

Note that b.3.E only happens because the difference between the file epoch (1,010,000,000) and rev E's timestamp is over 8 years apart and we assume that every revision is stored in the same file. If the actual storage of revisions starts dividing/sharding revisions into different files then the epochs could actually be different for each file meaning the time deltas could end up smaller and make 5B timestamps less likely and reduce the probability that 16B (rev_id, parent_id, timestamp) data would ever show up.