Wikimedia Maps/Reports/OSM replication disk usage 2020

Starting from December 24, 2019, disk space increased on Maps servers, especially on PostgreSQL, further investigations found a correlation between OSM replication and the disk space increase. The issue was found on Jan 24, 2020. At that moment, replication was changed to hourly rates, executing every 2 hours.

Solution
The most straightforward solution found by the team was to re-import OSM and re-enable OSM replication in a more spaced time, as part of T137939. The new replication rate is 12 hours, this initial work was tracked at T249086. To be sure that the disk space wouldn't happen again we performed a set of experimentations in the codfw cluster.

Experiment timeline
sudo -u postgres bash du -sh /srv/postgresql/9.6/main/base/* psql -d gis -t -c "SELECT TABLE_NAME, index_bytes, toast_bytes, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM (    SELECT relname AS TABLE_NAME         , pg_total_relation_size(c.oid) AS total_bytes          , pg_indexes_size(c.oid) AS index_bytes          , pg_total_relation_size(reltoastrelid) AS toast_bytes     FROM pg_class c     LEFT JOIN pg_namespace n ON n.oid = c.relnamespace     WHERE relkind = 'r' AND nspname = 'public'   ) a;"
 * Perform osm-intial-import script
 * After the re-import operation, DB size was equal to 793GB (measured at 04/06/2020 13:04)
 * Monitor disk-space usage using the following commands:


 * After collecting enough data we verified that the DB was still growing 1GB/day
 * The DB size increased to 817GB in 11 days (measured at 04/17/2020 14:50)
 * We Performed  in the PostgreSQL database to see how much of the extra disk-space could be recovered
 * We performed  to see how much of the extra disk-space could be recovered
 * After this step the Database disk space downsized to 793GB (measured at 04/28/2020 14:35), same as the initial size after the data re-import operation.
 * After this step the Database disk space downsized to 793GB (measured at 04/28/2020 14:35), same as the initial size after the data re-import operation.

Conclusion
We were able to better understand how the OSM DB behaves in our infrastructure and are able to perform better solutions in the future to avoid the problem to happen again. Our findings are:


 * 1) OSM replication causes an increase of disk space on PostgreSQL DB, mostly on indexes
 * 2) The solution of re-importing the OSM DB did work for a short term, but them the disk space started increasing again
 * 3) The time space between replications seem to be decisive on the growth rate of disk space
 * 4)   is not enough to clean the dangling data in the database,   is more effective
 * 5)   blocks the database for read/write operations, this forces us to depool the machines before running it
 * 6)   blocks the database for write-only operations, requiring us to only disable OSM replication for a short period of time