Technical decision making/Decision records/T275063

WHAT?
What is the problem or opportunity?
 * Maps infrastructure is in a critical state where the engineers responsible for it struggle to keep up with maintenance. The challenge exists in multiple domains: resource availability versus service deterioration speed, reflected in multiple production incidents ->
 * 1) Confusing ownership and level of support
 * 2) Expertise about the current maps technology, which we are trying to mitigate with proper documentation
 * 3) Maps still use components that differ from common MediaWiki infrastructure aking it harder in terms of available resource to execute maintenance work. We have the opportunity to change that by re-architecting and evolving the maps platform.
 * There is also an opportunity to give back to Wikimedia users a reliable and consistent experience contributing to and learning about geo-information.
 * 1) List of Maps incidents
 * 2) Maps is not a first-class service due to it’s confusing ownership and level of support
 * 3) Report on Maps Technical Debt by Product Infrastructure Team
 * 4) Empower maps support by providing better documentation
 * 5) Maps Infrastructure diagrams

What does the future look like if this is achieved?

 * Maps technology gets up to speed with industry standards and becomes a first-class service in the WMF infrastructure: more maintainable with clear separation of concerns regarding ownership of the infrastructure between SRE and the Product Infrastructure team (Product). PI will own the application side of it with SRE providing normal operations support (unlike the current maps stack).

What happens if we do nothing?

 * Maps quality degradation is evident: know-how is not spread through the Foundation
 * Maintenance is hard and just a few people have the knowledge to do it (Lottery factor), e. g. during outages just a few people might be able to fix it.
 * Technology is outdated and technical debt keeps accumulating and is impacting the User Experience and the maintenance
 * Service is not paged by SREs anymore due to the constant instability it faces
 * The lack of SLI/SLO is also a factor that was taken into consideration when maps was demoted to not-paged service
 * There isn’t a clear health indicator
 * Current resourcing can’t properly maintain the current state which affects the user perception when outages and issues happen
 * As of 2019, Maps were used on ~22K wikivoyage projects and ~10K wikipedia pages. Varnish received ~300 tile requests/s and Kartotherian received ~150 tile requests/s.
 * As of now, Maps are currently used on ~40k wikivoyage and ~40k wikipedia pages. From turnilo we can infer that Varnish received ~700 tile req/s and Kartotherian ~105 tile req/s. It is also good to remember that we don’t have 3rd party usage accessing our maps anymore.
 * Overall, the request rate indicates a non-trivial usage of maps.
 * Any additional background or context to provide? Planning documentation. Main Phabricator task.

Why?: Why is approaching the problem/opportunity valuable? What is the most valuable thing? Does it align to the MTP/2030 Strategy or Annual Plan? Rank values in order of importance. Make it explicit who this benefits and where the value is. Objective it supports and How
 * Wikimedia users will have a reliable and consistent experience contributing to and learning about geo-information. By rearchitecting the stack to use standard OSS maps technologies, we open up the possibility of improving Maps product experience down the line for both new and existing consumers of Wikimedia content. The Maps stack will be accessible technically to the OSS maps community and the engineers that maintain the Maps infrastructure foundation wide. In order to achieve these goals, there are a few objectives in which we plan to mitigate in the short-term as well as some longer-term objectives that we need to explore more in order to truly understand how we resolve the problems that we have today.
 * Short-term Objectives: Reduce map latency with OpenStreetMap (OSM)
 * Empower SRE to support maps related incidents and maintenance
 * Longer-term Objectives: Reduce SRE dependency, empower client-side autonomy
 * BW link
 * Hypothesis: We believe that modernizing the maps infrastructure will reduce complexity, enable monitoring capabilities, and better empower SRE to resolve issues quickly and intuitively.
 * OKR: Reduce maintenance burden and improve user uptime for the maps infrastructure. Effort on incidents is 2% or less of 5 FTE (i.e., less than 10% of an FTE) per quarter by EOFY. Maps tiles are not lagging by more than 3 days 99% of the time by EOFY.

Responsible

 * Product Infrastructure
 * Development and planning
 * Platform Engineering Team (with a caveat)
 * Deployment
 * PS: resources on the SRE/PET side are tight and also depend on multiple teams' decisions on how to carry-on the work needed to set up the needed infrastructure in place. The point of contact and responsible for that role might also change at some point for the same reason.

Accountable
Engineering Manager for Maps Project

Consulted

 * SRE (ServiceOps): This project will change maps architecture and its infrastructure. We have been consulting multiple members from the SRE team in order to have an architectural decision that can be beneficial for everyone who will be involved in maintenance work. One of the main goals is to reduce the complexity in the work so SRE can be more confident when maintaining part of the infrastructure for Maps, this means that the decisions will impact future work for the SRE team. We expect that our plan is aligned with SRE expectations to get Maps back to a first-class service, and we have Q3/Q4 to do the work. We also expect that SRE are capable to finish reviewing our Architectural proposals by the end of the quarter.
 * SRE (Data Persistence): Maps infrastructure is going to use Swift as the storage solution for vector tiles and this team might have input into how that storage engine is used and might be interested in monitoring usage and performance characteristics.
 * Performance: Given that this is meant to be a non-user-facing replacement of the maps stack in the backend, Performance Team may have input about acceptance criteria for production deployments.
 * Security: Since we are deploying new libraries in a production setting, this consultation may just be in the form of a customary risk assessment to make sure we haven’t overlooked anything.
 * Product Leadership: This project and work has proceeded thus far with the backing of management. But, it would be useful to get additional commitments for continued maintenance of the maps stack. This has been included based on feedback from the Technical Forum members.

Informed:

 * Architecture: The proposed change here is not a radical change from how maps stacks are typically stood up and this change brings Maps stack in line with standard maps stacks practices. So, while it might seem this team might need to be consulted, we feel it is sufficient that they get a heads up in case they want to weigh in on this.
 * WMDE: The work here is of interest for WMDE because it impacts some of their own objectives regarding maps
 * Product: Product department upper-level management
 * Legal
 * Product Feature Teams at Large (Editing, Language, Apps, etc.): While this part of the Maps 2.0 work is strictly about the backend infrastructure and leaves the front-end user experience untouched, there are changes to the components used in the front-end user experience which might improve any future Maps product offerings where these feature teams might have a stake in. Given that context, it would be useful to make sure all the product teams are aware of these infrastructure changes in case they want to weigh in about any of the choices

Maps 2.0 Decision Record

 * What are your constraints? Many times we have implicit constraints, based on time, resources, performance, security and other aspects. This is a place to make them explicit and share with your team and stakeholders.


 * General Assumptions and Requirements: Use a new line for each assumption or requirement which you are using to constrain your proposed solutions.
 * Source: Maps Product is used heavily on the wikis and turning it off is not a realistic option.
 * Leaving the Maps stack in its broken state is not a realistic option either given the maintenance burdens associated with the “current” stack.
 * Maps Team: There cannot be any user-facing feature changes as part of any infrastructure maintenance or upgrade work.
 * Product Mgmt
 * Security Requirements: Nothing new beyond what is current in place for the maps stack.
 * Privacy Requirements: Nothing new beyond what is current in place for the maps stack.

Maps Team Important Question
 * Who can answer? Resolution, answer or action
 * What is the backend storage engine for vector tiles? SRE (ServiceOps), SRE (Data Persistence)
 * Swift (answered with input from SRE teams). As a point of comparison, Cassandra is being used with the current stack.
 * Should we continue pre-generating tiles or use a true cache where tiles are generated on demand and cached or some other intermediate strategy?
 * Maps Team: We are going to continue with the existing strategy for now to minimize design changes needed for rollout and will pick this up after the new implementation is live in production and stable.
 * Will it have impact in UI/UX and important features like i18n?
 * Maps Team: No, the work will not change the vector-tile schema and the ETL pipeline to load OSM data, therefore the data won’t change, only it’s architecture pipeline. The rasterization component remains the same.

Selected Option: Option 4
We think options 1 and 2 are not serious candidates and borderline on being strawman options. This is also reflected in the assumptions piece of this document. Turning off a widely used product (especially on a sister project) feels like a non-starter. And, given the state of the maps stack, doing nothing and letting it rot and get worse is also a non-starter for the same reason.
 * Rationale

So, only options 3 and 4 are / were potentially viable candidates. Option 3 was seriously pursued at one point over the last couple years and no feasible partnership plan emerged that we could rely on. We made attempts for partnership with other companies and it didn’t work out which might be an indicator of the market status.

That basically left us with a modified version of option 1 (do the least amount of work that will keep the hobbled maps stack functional) and option 4. The Maps team seriously examined the above two possibilities and with a bunch of work spikes, came up with a fully fleshed out plan that makes a really strong case for upgrading the backend maps infrastructure rather than continuing to invest in the ongoing maintenance of a dead-end solution that we have relied on.

So, that really basically leaves us with Option 4. Upgrading the maps stack to be more in line with good architectural practices and modern components opens up the possibilities for the product offerings to be enhanced at a future time without painting ourselves into a corner because of large scale technical debt on the backend that needs to be paid back first. Doing this work now lets us ramp up product work quickly in the future.
 * Data: Link to any data you used to support your decision
 * Informing: This work has been going on for a while. We have got product management buy in for this decision and we’ve also already made announcements more broadly as well.
 * Who: Engineering Manager for Maps Project
 * Date: Work on this project started in 2020 and in that timeframe, the Maps Team consulted with a number of stakeholders and arrived at a possible proposal (Maps 2.0). In 2021, we put that proposal through the TDMP to ensure we haven’t missed anything significant (even though the work from 2020 and other work in previous years on Options 1 and 3 had convinced us that this was most likely the only viable option available to us). So, while the decision hasn’t been finalized yet, realistically, given what we know about all the options and the feedback we’ve received so far via the TDMP, we have proceeded with the full implementation of Option #4. But, strictly speaking, the decision will only be considered finalized once we formally complete the consultations and reviews in this process.

Technical Forum Chair review
 * Are the options detailed enough to make an informed decision? Yes, they were researched and tried over the course of multiple years. Solution satisfies requirements. No major show-stopping feedback from the consulted parties.
 * Were all parties identified in the RACI consulted? Yes, all were consulted. Performance was moved to informed during the process of reviewing options.
 * Does this decision require C-level review? Why or why not? No. Uncontentious decision. Pathways were researched for multiple years. Solution does not make significant UX changes. Has a clear signoff from SRE (the team bearing the brunt of operational issues).

Option 1: Do Nothing
Consultations
 * Description: This was the default option where the Maps stack receives bare minimum maintenance to keep it operational. In reality, we consider this a strawman proposal because this option is what brought us to this current state where we are having to undertake a more significant effort to upgrade the stack and where SRE turned off paging on this service.
 * Benefits: The ostensible benefit is that it will require very little developer and other resources.
 * Risks: The libraries and components that make up the stack go unmaintained, lose support, and the effort needed to recover from this is worse / much higher than what is needed to take proactive steps to address the problems.
 * Effort: Depends on what “maintenance” and “operational” would mean. But, at the most conservative, probably no more than 1 developer.
 * Costs: Ongoing costs will probably be the costs associated with dealing with breakages from unmaintained and out-of-support / end-of-life components.
 * Testing: None that we are aware of beyond whatever is needed to ensure the service is operational.
 * Performance & Scaling: When facing needs for augmenting performance and scale the product, the current stack solution is always “buying more hardware”, because there is too much technical debt to pay down before performance and maintainability of the services can be addressed.
 * Deployment: Not applicable
 * Rollback and reversibility: Not applicable
 * Operations & Monitoring: Nothing beyond what is already in place. If necessary, additional monitoring regarding usage, performance, uptime. But, SRE had turned off paging for this status quo option.
 * Additional References: Wikitech project page https://wikitech.wikimedia.org/wiki/Maps
 * Product Management: Not a realistic option given complaints from SRE. The amount of pain will only increase over time. Doing nothing is not a realistic option. Note that the current situation is already doing more than nothing and is not sufficient to keep up.
 * SRE (ServiceOps): SRE had turned off paging alerts on the old service and they were unhappy with the situation where this service had reached this stage. So, doing nothing isn’t a realistic option.
 * SRE (Data Persistence): Not consulted since their input here is not relevant for this option.
 * Performance: Not consulted since their input here is not relevant for this option.
 * Security: Not consulted since their input here is not relevant for this option.

Option 2: Deprecate and remove usage on the wikis and decommission all maps services in production
Consultations Product Management: Product Management does not want to turn off maps.While this is a strawdog proposal, I think it is a better option than option 1. While there is a vocal community around maps, and it is extremely important to WikiVoyage, most wikis can probably do what they need without Maps.
 * Description: What it says on the tin. In reality, we consider this also a strawman proposal not worthy of serious consideration. However, this option needs to be addressed since this is not as unreasonable as it sounds and has been offered as a possibility in the first round of TDMP feedback.
 * Benefits: We eliminate the costs and burdens of maintaining and supporting a technology stack that has been a source of headaches for the Product & Technology departments. We are then able to divert those resources to other work streams.
 * Risks: The Foundation has received criticism over the years of not paying sufficient attention to (a) what the editor community actually wants (b) projects that are not wikipedias. Maps are used more heavily on the wikivoyage projects and maps feature requests have shown up in Community Tech Wishlists over the years. So, the risk of turning off maps is not just running counter to these currents, but actively breaking useful functionality on wikivoyage that will likely lead to a lot of serious heartburn.
 * Effort: Probably a not insignificant amount of time since it will probably need discussions at the product level, community relations level to identify strategies for doing this with minimal backlash. And, then there is the effort involved in feature deprecation and eventual decommissioning.
 * Costs: Nothing beyond the costs involved in supporting the effort outlined above
 * Testing: Probably float some minimal proposals on some wikis to guage community reaction?
 * Performance & Scaling: Not applicable
 * Deployment: Not applicable
 * Rollback and reversibility: I suppose the technical decisions are reversible. But, the community backlash and loss of social capital are not reversible. Even the technical decisions aren’t reversible after a point.
 * Operations & Monitoring: Surveys / other ways of monitoring reception.
 * Additional References: None at this time since we don’t consider this a real serious option.
 * SRE (ServiceOps): Not consulted since their input here is not relevant for this option - this is primarily a product management decision.
 * SRE (Data Persistence): Not consulted since their input here is not relevant for this option - this is primarily a product management decision.
 * Performance: Not consulted since their input here is not relevant for this option - this is primarily a product management decision.
 * Security: Not consulted since their input here is not relevant for this option - this is primarily a product management decision.

Option 3: Switch the backend maps support to an industry partner who has better experience / technical infrastructure related to the maps experience
Consultations
 * Description: Rather than WMF maintaining the maps infrastructure in-house, the plan here is to identify an external entity to partner with that has robust maps infrastructure that we can leverage (by adding suitable internal components to connect it with wikis).
 * Benefits: Reduces in-house needed technical expertise for maps, Simplifies technical stack and reduces maintenance costs, Automatically benefit from technical upgrades and improvements without needing dedicated investments.
 * Risks: Dependency on an external partner for key aspects of our product experience. Changes in the availability, terms of use, licensing, or other aspects of the user experience can negatively impact user experience on wikis. Potential privacy concerns related to accessing third-party services for on-wiki experiences. Unclear if this will have larger Wikimedia community buy-in. Not owning the data sources can make things complicated for controversial topics like disputed borders between countries for example.
 * Effort: One-time upfront work researching 3rd party service, One-time upfront work getting organization and community buy-in, One-time upfront partnership work related to negotiating and setting up a partnership with a selected service, Technical work related to transitioning to this new setup. So overall, probably a good part of a year or more across all pieces.
 * Costs: There will still be some maintenance work related to any interface components between the 3rd party services and Wikimedia usage.
 * Testing: Establishing privacy, performance, and other product usage considerations are satisfied.
 * Performance & Scaling: Testing and benchmarking to ensure that the 3rd party service meets latency targets/metrics and can also scale to different demand profiles on wikis.
 * Deployment: Roll out plans to different wikis (including any rollback options in case of problems) that are reviewed by SRE. We also need to establish adequate monitoring of performance metrics to ensure we aren’t impacting the on-wiki experience. We also need to have in place appropriate processes and avenues to flag and escalate issues with the partner before any rollout.
 * Rollback and reversibility: As long as the in-house technical stack and services have not been fully decommissioned, the 3rd party usage can be rolled back. But, after a point, it is not reversible short of building a new in-house stack.
 * Operations & Monitoring: Once the new solution has been deployed what data needs to be collected to monitor activity, provide feedback or report on system/service health?
 * Product Management: Product department had pursued this option seriously and it turned out to not be a feasible option
 * SRE (ServiceOps): Not consulted since their input here is not relevant for this option given the result of early explorations by Product Engineering & Mgmt.
 * SRE (Data Persistence): Not consulted since their input here is not relevant for this option given the result of early explorations by Product Engineering & Mgmt.
 * Performance: Not consulted since their input here is not relevant for this option given the result of early explorations by Product Engineering & Mgmt.
 * Security: Not consulted since their input here is not relevant for this option given the result of early explorations by Product Engineering & Mgmt.

Option 4: Rearchitect the backend maps infrastructure (to use more industry-standard practices and rely on existing open-source maps components)
Consultations
 * Description: This Future of maps document has the full proposal, but the TLDR is to redo the backend infrastructure to use more open source components for the maps stack (vector tile server, mapbox styles and front-end components) and architect the stack more in line with industry-standard practices.
 * Benefits: Maps functionality is definitely not a wiki-specific product feature. Maps are used widely on the internet. So, there is really very little reason to do bespoke work on this product offering. Anything we can do to align our technology stack with best practices and components that are widely in use will put us on a path of increased stability, available expertise, and wide community support. Fewer in-house components for map tile serving, map rendering components and greater reliance on more widely-used and maintained open-source components. Reduced maintenance burdens by simplifying and standardizing the maps stack. Ability to leverage external partners (ex: consultants, contractors) for augmenting internal capacity and expertise. Ability to modernize the maps front-end. Open up a path for future maps product feature work.
 * Risks: Not insignificant amount of work doing the upgrade work
 * Potential for scope creep: Insufficient management buy in and related management and organization fatigue around technical debt work and push towards other seemingly simpler options
 * Effort: Two FTEs working for about 3 months planning, prototyping, and exploratory work (note that this work has already been done in 2020). Two FTEs putting in 6-9 months of work to upgrade the tile serving infrastructure to use off-the-shelf open source components.
 * Costs: Upfront costs related to researching, planning, and exploratory efforts in full fleshing out the plan. This should also include rollout and rollback plans, and any prototyping work to establish end-to-end viability of the change.
 * Testing: Most of this has been done in early exploratory and prototyping efforts in 2020. In addition, end-to-end tests of the new infrastructure need to be completed. This is being done on the cloud VM infrastructure and we expect much of it to be done by the first week of May.
 * Performance & Scaling: The criteria we are working with is that response / latency metrics of the upgraded infrastructure can be no worse than the current stack. Performance benchmarking would need to be in place to establish that. There also needs to be sufficient capacity planning to ensure that the replacement infrastructure has enough storage and computational capacity that matches or exceeds current usage. We are in the process of going through the performance benchmarking currently.
 * Deployment: A detailed rollout plan for wikis including rollback strategies if we run into problems. This needs to specify how the deployment will proceed (wiki-by-wiki, group-by-group, by slowly scaling traffic, some combination of them) in the production context. There also needs to be adequate planning to ensure there is enough capacity in the cluster in the time when both the old and new infrastructure might be live at the same time.
 * Rollback and reversibility: The deployment plan doesn’t call for a complete turning off of the old stack right away. But after some period of time (to be decided), the old stack will be completely turned off. At that point, this plan is no longer reversible since one of the reasons we are undertaking this work is to address maintenance and EOL issues with components of the old stack.
 * Operations & Monitoring: Response and other metrics that SRE needs to be able to adequately monitor and ensure uptime of the service. I am skipping the details here but we have detailed information about the metrics that SRE mandated for the new infrastructure.
 * Additional References: A couple of documents have been linked in responses in the table. But, collecting them all here again for ease of reference:
 * Wikimedia Maps: Master Planning Document
 * Future of maps infra (this is more of an internal and earlier version of document #1 and is mostly a duplicate of #1, but this one has more information about problems with the old / current stack)
 * [DRAFT] Rollout plan for Maps 2: https://phabricator.wikimedia.org/T263854 is the tracking phab task for Iteration #1 of the fuller maps modernization plan elaborated in documents 1 & 2.
 * Product Management: Product Management have signed off on this. This work is primarily infrastructural and doesn’t involve any product / user-facing feature changes as one of the requirements / constraints provided.
 * SRE (ServiceOps): SRE (ServiceOps) has reviewed the plans and the current version of the plan incorporates changes recommended by them. This plan has been signed off by them.
 * SRE (Data Persistence): SRE (Data Persistence) has reviewed the plans and the current version of the proposed solution incorporates their feedback. This plan has been signed off by them.
 * Performance: Performance benchmarking results indicates that there is no performance degradation and if anything, it might improve. As such, we might mostly inform the Performance team as a heads-up and not really consult them.
 * Security: Security Team has tasks to review the new components that we are introducing as part of this implementation.
 * Architecture: While not listed as a consulted party for the problem in general, architecture was involved in early consultations with the Maps Team in 2020 as someone with prior experience with the Maps products and Maps ecosystem and contributed to the decision-making.