Core Platform Team/Initiative/Enable Multi-DC Session Storage/Initiative Description

Project Lead
Eric Evans

Current state
In development, alpha. Ready for deployment end of March. Kubernetes image will be ready end of March

Expected start
Started in October 2018

Summary
Develop a multi-master replicated key-value storage service, the semantics of which permit session access from MediaWiki in an active-active, multi-datacenter configuration. Secondarily, the service decouples MediaWiki from storage, creating additional isolation of sensitive data.

Significance and motivation
This is a blocker to enable active-active data center. Enables multi-data center session access. Makes the system more fault tolerant and resistant. Secondarily, it isolates session data.

Milestones and major tasks

 * Hardware request and setup


 * RFC for the session storage API
 * Investigate use of Redis session storage to see if there is extra work
 * Design implementation (storage, replication semantics, performance)
 * Test and prototype in multiple languages to understand performance/latency/throughput
 * Implementation
 * Figure out deployment method
 * CI for build testing docker image creation
 * Cassandra cluster configuration
 * Beta deployment
 * Develop migration plan
 * Integrate with MediaWiki
 * Determine if “Set if not exist” functionality is needed (implement if needed)
 * Determine if Per operation defined TTLs are needed (implement if needed)
 * Enable functional testing (set up and tear down of Cassandra)
 * Security review
 * Implementing service-checker functionality (endpoint monitoring)
 * Figure out the Kubernetes deployment (Helm charts)
 * Deploy according to migration plan (test wikis, etc…)

Outcome
Increase the scalability of the platform for future applications and new types of content, as well as a growing user base and amount of content

Baseline

 * Sessions are accessed from 1 Data Center

Target

 * Sessions can be accessed from 2 Data Centers

Methodology and rationale
The key result here is allowing sessions to be

Sessions can be accessed from more than 1 data center

Time and resource estimate
2 FTE for 6 months (FY1819 Q2-Q3)

2 part time engineers for 3 months during deployment (FY1819 Q4)

Dependencies
Setup Kubernetes security zone (SRE)

Security review (Security - 30 day lead time)

Collaborators
Core Platform

SRE

Security

Stakeholders
SRE

Performance

Open questions
Should central auth metadata be stored in the same or different kask instance?

Is “Set if not exist” functionality is needed?

Are Per operation defined TTLs are needed?

Phabricator
https://phabricator.wikimedia.org/T206016 (master ticket)

Relevant materials, plans and RFCs
Requests for comment/SessionStorageAPI