Many of the analytics services at WMF rely upon geocoding ip addresses associated with web requests. To do this, we use the MaxMind service, which provides updated mappings between IP addresses and city-/country-/region-/coordinate-level information. These mappings live in simple binary database files which can be downloaded from their website with the appropriate credentials. Until recently, the details of this process were relatively unexamined, often leading to the use of a binary database of unknown origins which could have been out of date by several years. As a result of some unexpected changes in page views by country counts, we have decided to examine the possible effects of using an out of date binary database.
Overall Mismatches As a Function of Months Out of Date
For this analysis, we took a sampled squid log file from February 1, 2013 (/a/squid/archive/sampled/sampled-1000.log-20130101.gz) and geocoded it with the latest GeoIP database as of each month within the most recent 6 months and then every 6 months for the most recent 4 years. On the suspicion that certain countries might be more prone to this sort of error, we also filtered the results according to the country from which the most recent database considers the request to have originated.
High Traffic Countries
A natural question to ask is whether this has a signifcant effect on high traffic countries. We simply counted the number of page requests which each version of the MaxMind database tagged as originating from each country and took the countries with the most page requests in the most recent database
Month over Month Change in Geocoded Requests
We also wanted to get a sense of how dramatic the changes in geocoded requests from a particular request might be. To do this in a way which was agnostic to the total number of requests coming from a country, we looked at the month over month percentage change and looked at the top 10 countries according to the maximum month over month percent change. The graph below is rather concerning because it suggests that for many countries, the reported traffic will vary wildly between database versions which were used. Even a relatively large country like Canada shows up with a 100% month over month change.