Requests for comment/Multi datacenter strategy for MediaWiki/Progress

References: Multi DC strategy RFC:
 * Multi-DC master tracking task https://phabricator.wikimedia.org/T88445
 * https://phabricator.wikimedia.org/T88666
 * Requests for comment/Multi datacenter strategy for MediaWiki

Multi-DC sync-up meeting regular attendees:
 * Aaron
 * Stas
 * Gabriel
 * Brandon
 * Giuseppe
 * Filippo
 * Gilles
 * Timo
 * JaimeC

2016-08-17

MediaWiki:
 * [assigned] Flow cache purges to use WAN cache ( https://phabricator.wikimedia.org/T120009 )
 * [blocked] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
 * patch reverted for now (user JS breakage); patch to be tweaked
 * needs user input; ask comm laisons, ask Design/Reading?
 * [in progress] wikidata master queries ( https://phabricator.wikimedia.org/T110399 )
 * Subtask created: https://phabricator.wikimedia.org/T138376
 * First patch: https://gerrit.wikimedia.org/r/#/c/302199/1

Configuration:
 * [unstarted] Switch parts of config to something like etcd.
 * https://phabricator.wikimedia.org/T119641

Databases: Media storage / Swift:
 * [done] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
 * [in progress] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )
 * Make sure cross-DB TLS new connections are rare (10x worse latency for opening connections vs non-SSL) - We already use it for replication (1 continous connection) with no visible overhead
 * Certificate management???
 * I need to coordinate with Performance and Availability to standarize all MySQL services with the same HA solution. That may require mediawiki changes so that most of https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php gets simplified to a single ip + port per "micro-service". Also probably those 2 files should disappear and only have db.php, given that we will have a single active-active setup (?) T141547
 * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455
 * swiftrepl/MediaWiki cross-dc writes uses HTTP now. Lets clean this up before doing active/active though.

Session storage / redis:
 * [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
 * Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
 * What is the advantage of using restbase vs. direct cassandra?
 * RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase
 * https://phabricator.wikimedia.org/T140813
 * Last meeting affirmed cautious support for cassandra/hyperswitch
 * Idea of services team focusing on session before auth storage was floated (would be useful for multi-DC work)

CDN / traffic:
 * [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
 * Patch to distinguish callback updates deployed
 * Graph for GETs: https://graphite.wikimedia.org/render/?width=973&height=470&_salt=1470217359.097&target=highestCurrent(MediaWiki.deferred_updates.GET.*.rate%2C8)
 * Mostly logging, parsercache updates, spreadAnyEditBlock is 20/minute
 * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820

Services:
 * [in progress] look into mcrouter too see if it can work for WANCache
 * Either email some people use a github question
 * initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/
 * Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813
 * Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
 * ACTION: Gabriel to set up meeting for session storage next week.

The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Workboard: https://phabricator.wikimedia.org/tag/wikimedia-multiple-active-datacenters/ Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?

2016-08-03

MediaWiki:
 * [assigned] Flow cache purges to use WAN cache ( https://phabricator.wikimedia.org/T120009 )
 * [blocked] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
 * patch reverted for now (user JS breakage); patch to be tweaked
 * needs user input; ask comm laisons, ask Design/Reading?
 * [in progress] wikidata master queries ( https://phabricator.wikimedia.org/T110399 )
 * Subtask created: https://phabricator.wikimedia.org/T138376
 * First patch: https://gerrit.wikimedia.org/r/#/c/302199/1

Configuration: [unstarted] Switch parts of config to something like etcd.
 * https://phabricator.wikimedia.org/T119641

Databases:
 * [unblocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
 * Config patch at https://gerrit.wikimedia.org/r/#/c/243116/
 * datacenter column now present \o/
 * [unblocked] Deploy MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/)
 * Patch merged in core
 * Config patch at https://gerrit.wikimedia.org/r/#/c/302635/ (might do testwiki first though)
 * [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )
 * Make sure cross-DB TLS connections are rare (10x worse latency for opening connections vs non-SSL) - We already use it for replication (1 continous connection) with no visible overhead
 * Certificate management???
 * [status?] ES compression...blocker?
 * https://phabricator.wikimedia.org/T106386
 * Not a blocker

Media storage / Swift:
 * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455
 * swiftrepl uses HTTP now. Do want to add MediaWiki to this?
 * [HARD BLOCKER] lets do SSL first

Session storage / redis:
 * [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
 * Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
 * What is the advantage of using restbase vs. direct cassandra?
 * RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase
 * https://phabricator.wikimedia.org/T140813
 * Last meeting affirmed cautious support for cassandra/hyperswitch
 * Idea of services team focusing on session before auth storage was floated (would be useful for multi-DC work)

CDN / traffic:
 * [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
 * Patch to distinguish callback updates deployed
 * Graph for GETs: https://graphite.wikimedia.org/render/?width=973&height=470&_salt=1470217359.097&target=highestCurrent(MediaWiki.deferred_updates.GET.*.rate%2C8)
 * Mostly logging, parsercache updates, spreadAnyEditBlock is 20/minute
 * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820

Services:
 * [unstarted] look into mcrouter too see if it can work for WANCache
 * Either email some people use a github question
 * initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/
 * Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813
 * Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
 * ACTION: Gabriel to set up meeting for session storage next week.

The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Use tracking ticket: ACTION: Aaron to create, discuss at next meeting. Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?

2016-07-20

MediaWiki:
 * [done] restbase BagOStuff subclass (https://phabricator.wikimedia.org/T137272 )
 * [assigned] Flow cache purges to use WAN cache ( https://phabricator.wikimedia.org/T120009 )
 * [blocked] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
 * patch reverted for now (user JS breakage); patch to be tweaked
 * needs user input; ask comm laisons, ask Design/Reading?
 * [unstarted] wikidata master queries (T110399)
 * Subtask created: T138376
 * [done] notify users to use POST for rollback/markpatrolled/purge tools

Databases:
 * [blocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
 * Config patch https://gerrit.wikimedia.org/r/#/c/243116/ waiting on 'datacenter' pt-heartbeat table column
 * [done] MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/)
 * Initial version done, maybe test in betalabs with mariadb next?
 * [done] talk to RE about mariadb version (https://phabricator.wikimedia.org/T138778)
 * [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )

Media storage / Swift:
 * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis:
 * [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
 * Old patch for direct casandra use: https://gerrit.wikimedia.org/r/#/c/238370/1/includes/libs/objectcache/CassandraBagOStuff.php
 * Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
 * What is the advantage of using restbase vs. direct cassandra?
 * RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase
 * https://phabricator.wikimedia.org/T140813

CDN / traffic:
 * [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
 * T137326: done
 * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820

Services:
 * [unstarted] change_propagation module for CDN cache purges
 * [unstarted] look into mcrouter too see if it can work
 * initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/
 * [unstarted] develop xkey purge strategy: Brandon to set up initial brainstorm meeting
 * looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718
 * new librdkafka based node client looking good, starting beta testing; adds Kafka 0.9/0.10 support
 * Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813
 * Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
 * ACTION: Gabriel to set up meeting for session storage next week.

The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Use tracking ticket: ACTION: Aaron to create, discuss at next meeting. Aaron: I'd rather use a tag and board, TODO Blocking tasks are now all in etherpad now Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?

2016-06-22

MediaWiki:
 * [under review] restbase BagOStuff subclass (https://phabricator.wikimedia.org/T137272 )
 * [unassigned] Flow cache purges ( https://phabricator.wikimedia.org/T120009 )
 * [assigned] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
 * reverted for now (user JS breakage); patch to be tweaked
 * [unstarted] wikidata master queries (T110399)
 * Subtask created: T138376
 * [in progress] notify users to use POST for rollback/markpatrolled/purge tools

Databases:
 * [blocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
 * Config patch https://gerrit.wikimedia.org/r/#/c/243116/ waiting on 'datacenter' pt-heartbeat table column
 * [in progress] MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/)
 * Initial version done, maybe test in betalabs with mariadb next?
 * [ACTION] talk to RE about mariadb version
 * [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )

Media storage / Swift:
 * [done] Experiment with sync/async and watch statsd for api entry point for multiwrite backend
 * done and left on; no noticeable effect on api entry points
 * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis:
 * [in progress] Use restbase/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
 * Old patch for direct casandra use: https://gerrit.wikimedia.org/r/#/c/238370/1/includes/libs/objectcache/CassandraBagOStuff.php
 * Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed

CDN / traffic:
 * [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
 * T137326: done
 * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820

Services:
 * change_propagation module for CDN cache purges
 * [unstarted] look into mcrouter too see if it can work
 * initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/
 * looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718

2016-06-08

MediaWiki:
 * [unassigned] restbase BagOStuff subclass (https://phabricator.wikimedia.org/T137272 )
 * [unassigned] Flow cache purges ( https://phabricator.wikimedia.org/T120009 )
 * [assigned] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
 * reverted for now (user JS breakage); patch to be tweaked
 * [unstarted] wikidata master queries (T110399)

Databases:
 * [blocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
 * Config patch https://gerrit.wikimedia.org/r/#/c/243116/ waiting on 'datacenter' pt-heartbeat table column
 * [wip] MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/)
 * Initial version done, maybe test in betalabs with mariadb next?
 * [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )

Media storage / Swift:
 * [unstarted] Experiment with sync/async and watch statsd for api entry point for multiwrite backend
 * statsd graphs finally fixed (at https://grafana.wikimedia.org/dashboard/db/api-requests)
 * use 'sync' if not too slow (little upload API speed change per statsd) (https://gerrit.wikimedia.org/r/293272)
 * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis:
 * [unassigned] Use restbase/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
 * Old patch for direct casandra use: https://gerrit.wikimedia.org/r/#/c/238370/1/includes/libs/objectcache/CassandraBagOStuff.php
 * Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed

CDN / traffic:
 * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
 * VCL or Apache proxying?
 * Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
 * https://phabricator.wikimedia.org/T92357 tracks master queries on GET/HEAD
 * [ACTION] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
 * Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
 * Too many deferred updates and a few sync exceptions (writes will be cross-DC then)
 * [status?] General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404

Services:
 * change_propagation module for CDN cache purges
 * [unstarted] look into mcrouter too see if it can work
 * looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718

2016-05-25

MediaWiki:
 * EventBus purge relayer for WAN cache https://phabricator.wikimedia.org/T134535
 * https://gerrit.wikimedia.org/r/#/c/288319/
 * Flow cache purges ( https://phabricator.wikimedia.org/T120009 )
 * action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
 * Reduce cross DC wiki DB queries
 * action=purge and wikidata (T110399)

Databases:
 * pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
 * We need both decent HA and correct lag estimates in all DCs
 * MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/)
 * Cross DC writes and TLS/SSL (e.g. writes via DeferredUpdates or CentralAuth autocreate) ( https://phabricator.wikimedia.org/T134809 )

Media storage / Swift:
 * FileBackendMultiWrite 'async' upload /thumbnail race conditions
 * Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend
 * Experiment with sync/async and watch statsd for api entry point
 * HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis:
 * Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
 * Use another system (like a cassandra cluster?) (https://phabricator.wikimedia.org/T134811 )

CDN / traffic:
 * VCL routing logic: https://phabricator.wikimedia.org/T91820
 * Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
 * Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
 * General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404
 * Experiment with % of traffic to codfw (avoid loops?)

Services:
 * change_propagation module for WAN cache purges

2016-05-11

ACTION ITEMS:

MediaWiki:
 * EventBus purge relayer for WAN cache https://phabricator.wikimedia.org/T134535
 * Flow cache purges ( https://phabricator.wikimedia.org/T120009 )
 * ?action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )

Databases:
 * pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
 * Related: cross-datacenter state visibility (in general, chronology checks) Use GTID? Use pt-heartbeat? Needs discussion. Joe mentiones that needs to work for "regular/simple" non-WMF mediawiki setups.
 * MASTER_POS_WAIT does not work cross-DC with current file/coords [Jaime will file a task]
 * Cross DC writes and TLS/SSL (e.g. writes via DeferredUpdates or CentralAuth autocreate)
 * Parsercache (not really DBs): General consensus on replacing the datastore from MySQL to somethings else with mult (which should eventually be done). Jaime proposes to do a couple of fixes to have something quicky.

Media storage / Swift:
 * FileBackendMultiWrite 'async' upload /thumbnail race conditions
 * Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend
 * Experiment with sync/async and watch statsd for api entry point
 * HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis:
 * Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 )
 * Blocked on TLS/SSL for apaches <=> redis (http://redis.io/topics/encryption not supported)
 * Maybe use another system (like a cassandra cluster?) (https://phabricator.wikimedia.org/T134811 )

ElasticSearch:
 * Basically ready

CDN / traffic:
 * VCL routing logic: https://phabricator.wikimedia.org/T91820
 * Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
 * Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
 * General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404
 * Experiment with % of traffic to codfw (avoid loops?)

Services:
 * change_propagation module for WAN cache purges