Wikimedia Maps/2015-2017/Tile server implementation

At this point, this is an incomplete list of ideas and technologies that Maps team is considering.

Subsysyems

 * Database: OpenStreetMaps data gets imported into PostgreSQL with PostGIS 2.1+ spatial extensions.
 * Renderer: A Node.js service that renders tiles. It renders all the tiles initially, then processes updates from the rendering queue.
 * Rendering queue: Being implemented ATM, most likely will be using Redis. Probably, will be the same softwre as tile server, but with different settings.
 * Tile storage: Currently we're toying with Cassandra. All tile servers should hold all the vector tiles because there's sufficient space for it. See considerations and ramblings below.
 * Tile server: A Node.js service that serves tiles, both vector (served verbatim) and raster (rendered to PNG or JPEG on the fly, rely on Varnish for caching). This is Kartotherian, implemented by us using open source components from Mapbox.
 * HTTP cache: Varnish.

Server roles

 * Database server: Postgres, renderer, rendering queue. For performance, it is essential that tiles get generated on the same machine as Postgres. TBD: whether it should have its own copy of tiles or it should just write to tileservers.
 * Tileserver: Tile storage and kartotherian.
 * Caches: TBD whether it should be on tileservers or use the overall caching infrastructure.

Workflow
We render all the vector tiles for zoom levels 0 to 14 and re-render affected tiles when underlying OSM data changes. For higher zoom levels (we plan to stop at z18 at this point), serve pieces of z14 tiles.

Storage
This table bellow summarizes our tile storage needs for one style. For the first iteration, we plan to store up to level 14 (total ~360 million tiles). There are two ways to significantly reduce this number - de-duplication and over-zooming.
 * Over-zooming is when we know that a tile does not contain more information than the corresponding piece of a tile in the lower zoom level, so instead of storing a tile, we simply extract a part of the tile above when needed.
 * De-duplication is when all files with the same content are stored once. Zoom level 9 contains 70% duplicates, and it is likely to grow for higher zooms. We have not yet found any off-the-shelf storage that can de-duplicate data. Yes we have: MBTiles ;) --Max Semenik (talk) 20:33, 17 June 2015 (UTC) Even if available, theoretically, storing only level 15 (230 tiles) would need to use 32bit number for tile identification (tile position->unique tile id). thus would need 4GB (4 * 230). Assuming 99% duplicates and 1.5KB average per tile, we need 16+ GB (230 / 100 * 1024 * 1.5) to store the actual tiles. Once we add the per-item storage overhead, the numbers will be significantly higher. It's better to have a good storage system with overhead than a crappy one without it. I would recommend to concentrate on the former rather than the latter ;) --Max Semenik (talk) 20:33, 17 June 2015 (UTC)

Tile states
In theory, each tile can be in four states: uninitialized, stored, stored-but-dirty, and use-over-zoom. When requested, the server has to decide to either use the stored vector tile, or go to the lower zooms until it finds a tile that exists and extract a piece of it (over-zoom). To decide this, server could attempt to get tile from the storage, and if missing, proceed to over-zoom, or it could store a large binary blob, with each bit showing if tile exists. It could even store 2 or even 3 bits per tile to minimize the subsequent checks - with value 0-7 indicating how many levels to zoom out (0=tile exists, 1=zoom out one level, ...).
 * We don't need to to store the uninitialized status because we don't need to expose a style as it's being generated. Just run a script that generates all the tiles (as in render or determine that it should be overzoomed) and then publish. Way saner, and no feet shot off.
 * The concept of marking tiles as dirty just borrows the worst parts of mod_tile/renderd user experience: oh, you want a dirty tile? Sure thing, let us spend a second on desperately trying to render it and then maybe serve the old tile anyway due to time out. A much smoother solution would be to keep a queue of tiles to be refershed and serve old tiles meanwhile, maintaining a good latency regardless of what's going on internally.
 * Therefore, I feel like the only part that's needed is actually the overzoom status. Max Semenik (talk) 20:33, 17 June 2015 (UTC)

No-extra-storage approach makes the system simpler, but requires progressive zoom-out one level at a time - each tile request would result in N requests to the storage system, each of which could be relatively expensive.

Storing over-zoom bits in Redis, which supports bit operations and is considerably faster, would reduce the number of storage requests to just one.
 * I would still like to investigate storing overzoom information directly in whatever storage system we will come up with. Max Semenik (talk) 20:33, 17 June 2015 (UTC)

Dirty tiles
Some tiles could get marked as dirty during the OSM import, or as part of the data layer SQL adjustments. There should be a background job going through all dirty tiles and regenerating them. Additionally, server could regenerate dirty tiles on the fly if user requests them. This is especially relevant to OSM itself, where the result of editing should be immediately visible, but could benefit WMF as we increase the OSM pull frequency.

Also, OSM servers might benefit from mod_tile approach, where the server attempts to update dirty tile on the fly, and only return dirty tile if the result is not received within a certain timeout. Server could use Redis flags (bits) to indicate if the tile is dirty, and request tile regeneration ahead of the job queue. WMF servers would probably avoid this, as it prises speed over result freshness.
 * Per my comments above, this is just horrible. Max Semenik (talk) 20:33, 17 June 2015 (UTC)

Tile ID
Each tile is identified by four parameters: style version, zoom level, and x-y coordinates. For storage, one 32 bit value should be enough to encode 14 zoom levels and 8 different styles. If we reduce the number of available styles to just 2, we could encode level 15 as well (for more styles, we could create a separate storage instance). zoomId  styleId 0-7    X-coordinate         Y-coordinate         extra bits 14:  1      000       0000 0000 0000 00    0000 0000 0000 00 13:   01     000       0000 0000 0000 0     0000 0000 0000 0      0 12:   001    000       0000 0000 0000       0000 0000 0000        00 11:   0001   000       0000 0000 000        0000 0000 000         000 10:   00001  000       0000 0000 00         0000 0000 00          0000 ...


 * What is this for? Sounds very rigid. Why put style into IDs at all? Not like one style will need to fall back on another style or anything like that. A much easier and saner way would be to completely separate styles from each other by means of Cassandra keyspaces/redis key names, etc. Max Semenik (talk) 19:45, 17 June 2015 (UTC)

Store one tile per file
This is the simplest approach - each tile is stored as an individual file. We have been told of an existing implementation that stores tiles as files up to level 14 (max 358 million), by increasing inodes count. By default, a 800GB partition has 51 million inodes. We are worried about the overhead here, especially since we will almost certainly need to store tiles for more than one data source later.

Store multiple tiles per file
There is an existing implementation called mbtiles, that stores multiple vector tiles together in one sqlite file. From what we heard, even Mapbox itself sees this as a dead-end approach for server-side tile storage, even though it is used heavily by Mapbox studio for storage and uploading. We might however use some ideas from it or even fork'n'rewrite it to our needs.

Store tiles in NoSQL
This approach might offer the most beneficial path forward. Depending on future performance studies, we could use Cassandra, Redis, or a large number of other nqSql implementations.

External links: nosql review, Cassandra vs Redis.


 * I think that Redis' in-memory only storage means that we can't use it. If you have doubts about SSD size, you should have waaaay more doubts about RAM for Redis. "Let's buy a TB per machine i case we need it"? Max Semenik (talk) 21:29, 17 June 2015 (UTC)

Optimized tile iteration
When generating tiles, it is often good to iterate in such a way as to minimize reloading of the lower-zoom tiles. Thus iterating an entire row of tiles before moving to the next row is not very efficient. The much more efficient method would be to iterate over the 4 tiles of zoom Z that all belong to 1 tile of zoom Z-1. Afterwards, the iteration should exhaust the next set of 4 tiles, such that they reuse the same Z-2 tile as the set before. The resulting pattern is (0,0), (0,1), (1,0), (1,1) on each zoom level: 00 01 04 05 16 17 20 21 02 03 06 07 18 19 22 23 08 09 12 13 24 25 28 29 10 11 14 15 26 27 30 31 32 33 36 37 48 49 52 53 34 35 38 39 50 51 54 55 40 41 44 45 56 57 60 61 42 43 46 47 58 59 62 63 One way to implement this algorithm is to have a 64bit integer, and iterate it from 0 to 4Z. For each value, extract X value by combining every odd bit and Y value from the even bits: So if we look at the above iteration sequence, the value 35 would be: decimal 35 =>  binary 0010 0011 X is every odd bit:    0 0  0 1  => 1 (decimal) Y is every even bit:  0 1  0 1   => 5 (decimal) Result - (1,5) tile coordinates


 * Not clear why is this 8x8 instead of 2x2? Max Semenik (talk) 20:35, 17 June 2015 (UTC)
 * The algorithm is the same for any square - I used 8x8 to show how iteration will happen with a bigger block - 2x2 might not have been obvious. --Yurik (talk) 16:16, 18 June 2015 (UTC)

Tile storage statistics
This snippet generates a two column list, where the second column is the "group size" - how many identical tiles were found. E.g. 1 means that this tile is unique, 2 means there exists exactly one other tile with the same content, etc. The first column is how many groups with that size were found. E.g. when groupsize=1, this is the number of unique tiles, for groupsize=2, it is the number of pairs found. The total number of stored tiles is the sum of these two columns multiplied. $ cd /srv/kartotherian/vectors $ find 5 -type f -name '*' -exec md5sum {} \; | cut -f 1 -d ' ' | sort | uniq -c | cut -c1-8 | sort | uniq -c

Tile (re)generation
The task of tile generation might take considerable time - depending on the zoom level (number of tiles), SQL server speed, and the number of servers doing the rendering. In order to manage this process, we plan to use a priority-queue, with persistent storage to mitigate crashes.

Priority queue implementation
There are several libraries available for NodeJS, all with Redis storage:
 * Kue (npm)
 * Bull (npm)
 * Barbeque (npm )