Requests for comment/Multi datacenter strategy for MediaWiki/Progress

References: Multi DC strategy RFC:
 * Multi-DC master tracking task https://phabricator.wikimedia.org/T88445
 * https://phabricator.wikimedia.org/T88666
 * https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki

Multi-DC sync-up meeting regular attendees:
 * Aaron
 * Stas
 * Gabriel
 * Brandon
 * Giuseppe
 * Filippo
 * Gilles
 * Timo
 * JaimeC

2016-08-17

MediaWiki: * [assigned] Flow cache purges to use WAN cache ( https://phabricator.wikimedia.org/T120009 ) * [blocked] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 ) ** patch reverted for now (user JS breakage); patch to be tweaked ** needs user input; ask comm laisons, ask Design/Reading? * [in progress] wikidata master queries ( https://phabricator.wikimedia.org/T110399 ) ** Subtask created: https://phabricator.wikimedia.org/T138376 ** First patch: https://gerrit.wikimedia.org/r/#/c/302199/1 Configuration: [unstarted] Switch parts of config to something like etcd. ** https://phabricator.wikimedia.org/T119641

Databases: * [done] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 ) * [in progress] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 ) ** Make sure cross-DB TLS new connections are rare (10x worse latency for opening connections vs non-SSL) - We already use it for replication (1 continous connection) with no visible overhead ** Certificate management??? ** I need to coordinate with Performance and Availability to standarize all MySQL services with the same HA solution. That may require mediawiki changes so that most of https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php gets simplified to a single ip + port per "micro-service". Also probably those 2 files should disappear and only have db.php, given that we will have a single active-active setup (?) T141547 Media storage / Swift: * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455 ** swiftrepl/MediaWiki cross-dc writes uses HTTP now. Lets clean this up before doing active/active though.

Session storage / redis: * [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 ) ** Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed ** What is the advantage of using restbase vs. direct cassandra? *** RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase *** https://phabricator.wikimedia.org/T140813 ** Last meeting affirmed cautious support for cassandra/hyperswitch *** Idea of services team focusing on session before auth storage was floated (would be useful for multi-DC work)

CDN / traffic: * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
 * [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
 * Patch to distinguish callback updates deployed
 * Graph for GETs: https://graphite.wikimedia.org/render/?width=973&height=470&_salt=1470217359.097&target=highestCurrent(MediaWiki.deferred_updates.GET.*.rate%2C8)
 * Mostly logging, parsercache updates, spreadAnyEditBlock is 20/minute

Services: * [in progress] look into mcrouter too see if it can work for WANCache ** Either email some people use a github question ** initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/ * Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813 ** Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware. ** ACTION: Gabriel to set up meeting for session storage next week.

The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Workboard: https://phabricator.wikimedia.org/tag/wikimedia-multiple-active-datacenters/ Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?

2016-08-03

MediaWiki: * [assigned] Flow cache purges to use WAN cache ( https://phabricator.wikimedia.org/T120009 ) * [blocked] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 ) ** patch reverted for now (user JS breakage); patch to be tweaked ** needs user input; ask comm laisons, ask Design/Reading? * [in progress] wikidata master queries ( https://phabricator.wikimedia.org/T110399 ) ** Subtask created: https://phabricator.wikimedia.org/T138376 ** First patch: https://gerrit.wikimedia.org/r/#/c/302199/1 Configuration: [unstarted] Switch parts of config to something like etcd. ** https://phabricator.wikimedia.org/T119641

Databases: * [unblocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 ) ** Config patch at https://gerrit.wikimedia.org/r/#/c/243116/ ** datacenter column now present \o/ * [unblocked] Deploy MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/) ** Patch merged in core ** Config patch at https://gerrit.wikimedia.org/r/#/c/302635/ (might do testwiki first though) * [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 ) ** Make sure cross-DB TLS connections are rare (10x worse latency for opening connections vs non-SSL) - We already use it for replication (1 continous connection) with no visible overhead ** Certificate management??? * [status?] ES compression...blocker? ** https://phabricator.wikimedia.org/T106386 ** Not a blocker

Media storage / Swift: * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455 ** swiftrepl uses HTTP now. Do want to add MediaWiki to this? *** [HARD BLOCKER] lets do SSL first

Session storage / redis: * [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 ) ** Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed ** What is the advantage of using restbase vs. direct cassandra? *** RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase *** https://phabricator.wikimedia.org/T140813 ** Last meeting affirmed cautious support for cassandra/hyperswitch *** Idea of services team focusing on session before auth storage was floated (would be useful for multi-DC work)

CDN / traffic: * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
 * [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
 * Patch to distinguish callback updates deployed
 * Graph for GETs: https://graphite.wikimedia.org/render/?width=973&height=470&_salt=1470217359.097&target=highestCurrent(MediaWiki.deferred_updates.GET.*.rate%2C8)
 * Mostly logging, parsercache updates, spreadAnyEditBlock is 20/minute

Services: * [unstarted] look into mcrouter too see if it can work for WANCache ** Either email some people use a github question ** initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/ * Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813 ** Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware. ** ACTION: Gabriel to set up meeting for session storage next week.

The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Use tracking ticket: ACTION: Aaron to create, discuss at next meeting. Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?

2016-07-20

MediaWiki: * [done] restbase BagOStuff subclass (https://phabricator.wikimedia.org/T137272 ) * [assigned] Flow cache purges to use WAN cache ( https://phabricator.wikimedia.org/T120009 ) * [blocked] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 ) ** patch reverted for now (user JS breakage); patch to be tweaked ** needs user input; ask comm laisons, ask Design/Reading? * [unstarted] wikidata master queries (T110399) ** Subtask created: T138376 * [done] notify users to use POST for rollback/markpatrolled/purge tools

Databases: * [blocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 ) ** Config patch https://gerrit.wikimedia.org/r/#/c/243116/ waiting on 'datacenter' pt-heartbeat table column * [done] MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/) ** Initial version done, maybe test in betalabs with mariadb next? ** [done] talk to RE about mariadb version (https://phabricator.wikimedia.org/T138778) * [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )

Media storage / Swift: * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis: * [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 ) ** Old patch for direct casandra use: https://gerrit.wikimedia.org/r/#/c/238370/1/includes/libs/objectcache/CassandraBagOStuff.php ** Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed ** What is the advantage of using restbase vs. direct cassandra? *** RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase *** https://phabricator.wikimedia.org/T140813

CDN / traffic: * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
 * [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
 * T137326: done

Services: * [unstarted] change_propagation module for CDN cache purges * [unstarted] look into mcrouter too see if it can work ** initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/ * [unstarted] develop xkey purge strategy: Brandon to set up initial brainstorm meeting * looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718 ** new librdkafka based node client looking good, starting beta testing; adds Kafka 0.9/0.10 support * Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813 ** Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware. ** ACTION: Gabriel to set up meeting for session storage next week.

The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Use tracking ticket: ACTION: Aaron to create, discuss at next meeting. Aaron: I'd rather use a tag and board, TODO Blocking tasks are now all in etherpad now Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?

2016-06-22

MediaWiki: * [under review] restbase BagOStuff subclass (https://phabricator.wikimedia.org/T137272 ) * [unassigned] Flow cache purges ( https://phabricator.wikimedia.org/T120009 ) * [assigned] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 ) ** reverted for now (user JS breakage); patch to be tweaked * [unstarted] wikidata master queries (T110399) ** Subtask created: T138376 *[in progress] notify users to use POST for rollback/markpatrolled/purge tools

Databases: * [blocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 ) ** Config patch https://gerrit.wikimedia.org/r/#/c/243116/ waiting on 'datacenter' pt-heartbeat table column * [in progress] MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/) ** Initial version done, maybe test in betalabs with mariadb next? ** [ACTION] talk to RE about mariadb version * [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )

Media storage / Swift: * [done] Experiment with sync/async and watch statsd for api entry point for multiwrite backend ** done and left on; no noticeable effect on api entry points * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis: * [in progress] Use restbase/cassandra cluster? (https://phabricator.wikimedia.org/T134811 ) ** Old patch for direct casandra use: https://gerrit.wikimedia.org/r/#/c/238370/1/includes/libs/objectcache/CassandraBagOStuff.php ** Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed

CDN / traffic: * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
 * [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
 * T137326: done

Services: * change_propagation module for CDN cache purges * [unstarted] look into mcrouter too see if it can work ** initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/ * looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718

2016-06-08

MediaWiki: * [unassigned] restbase BagOStuff subclass (https://phabricator.wikimedia.org/T137272 ) * [unassigned] Flow cache purges ( https://phabricator.wikimedia.org/T120009 ) * [assigned] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 ) ** reverted for now (user JS breakage); patch to be tweaked * [unstarted] wikidata master queries (T110399)

Databases: * [blocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 ) ** Config patch https://gerrit.wikimedia.org/r/#/c/243116/ waiting on 'datacenter' pt-heartbeat table column * [wip] MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/) ** Initial version done, maybe test in betalabs with mariadb next? * [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )

Media storage / Swift: * [unstarted] Experiment with sync/async and watch statsd for api entry point for multiwrite backend ** statsd graphs finally fixed (at https://grafana.wikimedia.org/dashboard/db/api-requests) ** use 'sync' if not too slow (little upload API speed change per statsd) (https://gerrit.wikimedia.org/r/293272) * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis: * [unassigned] Use restbase/cassandra cluster? (https://phabricator.wikimedia.org/T134811 ) ** Old patch for direct casandra use: https://gerrit.wikimedia.org/r/#/c/238370/1/includes/libs/objectcache/CassandraBagOStuff.php ** Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed

CDN / traffic: * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820 ** VCL or Apache proxying? ** Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this? *** https://phabricator.wikimedia.org/T92357 tracks master queries on GET/HEAD *** [ACTION] log all post-send DB updates to gauge frequency (we don't want too many threads tied up) ** Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?) *** Too many deferred updates and a few sync exceptions (writes will be cross-DC then) * [status?] General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404

Services: * change_propagation module for CDN cache purges * [unstarted] look into mcrouter too see if it can work * looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718

2016-05-25

MediaWiki: * EventBus purge relayer for WAN cache https://phabricator.wikimedia.org/T134535 ** https://gerrit.wikimedia.org/r/#/c/288319/ * Flow cache purges ( https://phabricator.wikimedia.org/T120009 ) * action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 ) * Reduce cross DC wiki DB queries ** action=purge and wikidata (T110399)

Databases: * pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 ) ** We need both decent HA and correct lag estimates in all DCs * MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/) * Cross DC writes and TLS/SSL (e.g. writes via DeferredUpdates or CentralAuth autocreate) ( https://phabricator.wikimedia.org/T134809 )

Media storage / Swift: * FileBackendMultiWrite 'async' upload /thumbnail race conditions - Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend - Experiment with sync/async and watch statsd for api entry point * HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis: * Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed **Use another system (like a cassandra cluster?) (https://phabricator.wikimedia.org/T134811 )

CDN / traffic: * VCL routing logic: https://phabricator.wikimedia.org/T91820 ** Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this? ** Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?) * General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404 * Experiment with % of traffic to codfw (avoid loops?)

Services: * change_propagation module for WAN cache purges

2016-05-11

ACTION ITEMS:

MediaWiki: * EventBus purge relayer for WAN cache https://phabricator.wikimedia.org/T134535 * Flow cache purges ( https://phabricator.wikimedia.org/T120009 ) * ?action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 ) Databases: * pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 ) * Related: cross-datacenter state visibility (in general, chronology checks) Use GTID? Use pt-heartbeat? Needs discussion. Joe mentiones that needs to work for "regular/simple" non-WMF mediawiki setups. ** MASTER_POS_WAIT does not work cross-DC with current file/coords [Jaime will file a task] * Cross DC writes and TLS/SSL (e.g. writes via DeferredUpdates or CentralAuth autocreate) * Parsercache (not really DBs): General consensus on replacing the datastore from MySQL to somethings else with mult (which should eventually be done). Jaime proposes to do a couple of fixes to have something quicky. Media storage / Swift: * FileBackendMultiWrite 'async' upload /thumbnail race conditions - Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend - Experiment with sync/async and watch statsd for api entry point * HTTPS for swift: https://phabricator.wikimedia.org/T127455 Session storage / redis: * Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) * Blocked on TLS/SSL for apaches <=> redis (http://redis.io/topics/encryption not supported) * Maybe use another system (like a cassandra cluster?) (https://phabricator.wikimedia.org/T134811 ) ElasticSearch: * Basically ready CDN / traffic: * VCL routing logic: https://phabricator.wikimedia.org/T91820 ** Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this? ** Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?) * General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404 * Experiment with % of traffic to codfw (avoid loops?) Services: * change_propagation module for WAN cache purges