Wikimedia Maps/2015-2017/Tile server implementation



At this point this is an incomplete list of ideas and technologies Maps team is considering.

Introduction
Our current prototype implementation imports data from OpenStreetMaps into PostgreSQL database with PostGIS 2.1+ spatial extensions. From there, vector tiles will be created using open source components from Mapbox and mapnik 3 engine, and stored in a local storage (TBD). Upon request, the vector tile will be converted to a PNG image on the fly and served by Kartotherian service via Varnish caching.

= Storage = This table bellow summarizes our tile storage needs for one style. For the first iteration, we plan to store up to level 14 (total ~360 million tiles). There are two ways to significantly reduce this number - de-duplication and over-zooming.
 * Over-zooming is when we know that a tile does not contain more information than the corresponding piece of a tile in the lower zoom level, so instead of storing a tile, we simply extract a part of the tile above when needed.
 * De-duplication is when all files with the same content are stored once. Zoom level 9 contains 70% duplicates, and it is likely to grow for higher zooms. We have not yet found any off-the-shelf storage that can de-duplicate data. Even if available, theoretically, storing only level 15 (2^30 tiles) would need to use 32bit number for tile identification (tile position->unique tile id). thus would need 4GB (4 * 2^30). Assuming 99% duplicates and 1.5KB average per tile, we need 16+ GB (2^30 / 100 * 1024 * 1.5) to store the actual tiles. Once we add the per-item storage overhead, the numbers will be significantly higher.

Tile ID
Each tile is identified by four parameters: style version, zoom level, and x-y coordinates. For storage, one 32 bit value should be enough to encode 14 zoom levels and 8 different styles. If we reduce the number of available styles to just 2, we could encode level 15 as well (for more styles, we could create a separate storage instance). zoomId  styleId 0-7    X-coordinate         Y-coordinate         extra bits 14:  1      000       0000 0000 0000 00    0000 0000 0000 00 13:   01     000       0000 0000 0000 0     0000 0000 0000 0      0 12:   001    000       0000 0000 0000       0000 0000 0000        00 11:   0001   000       0000 0000 000        0000 0000 000         000 10:   00001  000       0000 0000 00         0000 0000 00          0000 ...

Tile states
In theory, each tile can be in four states: not yet initialized, generated, dirty, and not-needed-because of over-zoom. Dirty tiles can still be served if the time of their regeneration is above a certain threshold.

One Tile per File
This is the simplest approach - each tile is stored as an individual file. We have been told of an existing implementation that stores tiles as files up to level 14 (max 358 million), by increasing inodes count. By default, a 800GB partition has 51 million inodes. We are worried about the overhead here.

Batch of Tiles per File
There is an existing implementation called mbtiles, that stores multiple vector tiles together in one sqlite file. From what we heard, even Mapbox itself sees this as a dead-end approach for server-side tile storage, even though it is used heavily by Mapbox studio for storage and uploading.

NoSQL Storage
This approach might offer the most beneficial path forward. Depending on future performance studies, we could use Cassandra, Redis, or a large number of other nqSql implementations.

External links: nosql review, Cassandra vs Redis.

Tile Invalidation Approaches
We have heard of implementations where a missing file indicates that over-zoom should be used, and initial initialization and dirty tiles are handled separately by job queues.