Requests for comment/Multi datacenter strategy for MediaWiki/Progress

References: Multi DC strategy RFC:
 * Multi-DC master tracking task https://phabricator.wikimedia.org/T88445
 * https://phabricator.wikimedia.org/T88666
 * https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki

Multi-DC sync-up meeting regular attendees:
 * Aaron
 * Stas
 * Gabriel
 * Brandon
 * Giuseppe
 * Filippo
 * Gilles
 * Timo
 * JaimeC

2016-06-22

MediaWiki: * [under review] restbase BagOStuff subclass (https://phabricator.wikimedia.org/T137272 ) * [unassigned] Flow cache purges ( https://phabricator.wikimedia.org/T120009 ) * [assigned] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 ) ** reverted for now (user JS breakage); patch to be tweaked * [unstarted] wikidata master queries (T110399) ** Subtask created: T138376 *[in progress] notify users to use POST for rollback/markpatrolled/purge tools

Databases: * [blocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 ) ** Config patch https://gerrit.wikimedia.org/r/#/c/243116/ waiting on 'datacenter' pt-heartbeat table column * [in progress] MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/) ** Initial version done, maybe test in betalabs with mariadb next? ** [ACTION] talk to RE about mariadb version * [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )

Media storage / Swift: * [done] Experiment with sync/async and watch statsd for api entry point for multiwrite backend ** done and left on; no noticeable effect on api entry points * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis: * [in progress] Use restbase/cassandra cluster? (https://phabricator.wikimedia.org/T134811 ) ** Old patch for direct casandra use: https://gerrit.wikimedia.org/r/#/c/238370/1/includes/libs/objectcache/CassandraBagOStuff.php ** Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed

CDN / traffic: * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
 * [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
 * T137326: done

Services: * change_propagation module for CDN cache purges * [unstarted] look into mcrouter too see if it can work ** initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/ * looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718

2016-06-08

MediaWiki: * [unassigned] restbase BagOStuff subclass (https://phabricator.wikimedia.org/T137272 ) * [unassigned] Flow cache purges ( https://phabricator.wikimedia.org/T120009 ) * [assigned] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 ) ** reverted for now (user JS breakage); patch to be tweaked * [unstarted] wikidata master queries (T110399)

Databases: * [blocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 ) ** Config patch https://gerrit.wikimedia.org/r/#/c/243116/ waiting on 'datacenter' pt-heartbeat table column * [wip] MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/) ** Initial version done, maybe test in betalabs with mariadb next? * [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )

Media storage / Swift: * [unstarted] Experiment with sync/async and watch statsd for api entry point for multiwrite backend ** statsd graphs finally fixed (at https://grafana.wikimedia.org/dashboard/db/api-requests) ** use 'sync' if not too slow (little upload API speed change per statsd) (https://gerrit.wikimedia.org/r/293272) * [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis: * [unassigned] Use restbase/cassandra cluster? (https://phabricator.wikimedia.org/T134811 ) ** Old patch for direct casandra use: https://gerrit.wikimedia.org/r/#/c/238370/1/includes/libs/objectcache/CassandraBagOStuff.php ** Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed

CDN / traffic: * [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820 ** VCL or Apache proxying? ** Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this? *** https://phabricator.wikimedia.org/T92357 tracks master queries on GET/HEAD *** [ACTION] log all post-send DB updates to gauge frequency (we don't want too many threads tied up) ** Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?) *** Too many deferred updates and a few sync exceptions (writes will be cross-DC then) * [status?] General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404

Services: * change_propagation module for CDN cache purges * [unstarted] look into mcrouter too see if it can work * looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718

2016-05-25

MediaWiki: * EventBus purge relayer for WAN cache https://phabricator.wikimedia.org/T134535 ** https://gerrit.wikimedia.org/r/#/c/288319/ * Flow cache purges ( https://phabricator.wikimedia.org/T120009 ) * action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 ) * Reduce cross DC wiki DB queries ** action=purge and wikidata (T110399)

Databases: * pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 ) ** We need both decent HA and correct lag estimates in all DCs * MASTER_GTID_WAIT support (https://gerrit.wikimedia.org/r/#/c/289985/) * Cross DC writes and TLS/SSL (e.g. writes via DeferredUpdates or CentralAuth autocreate) ( https://phabricator.wikimedia.org/T134809 )

Media storage / Swift: * FileBackendMultiWrite 'async' upload /thumbnail race conditions - Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend - Experiment with sync/async and watch statsd for api entry point * HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis: * Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed **Use another system (like a cassandra cluster?) (https://phabricator.wikimedia.org/T134811 )

CDN / traffic: * VCL routing logic: https://phabricator.wikimedia.org/T91820 ** Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this? ** Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?) * General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404 * Experiment with % of traffic to codfw (avoid loops?)

Services: * change_propagation module for WAN cache purges

2016-05-11

ACTION ITEMS:

MediaWiki: * EventBus purge relayer for WAN cache https://phabricator.wikimedia.org/T134535 * Flow cache purges ( https://phabricator.wikimedia.org/T120009 ) * ?action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 ) Databases: * pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 ) * Related: cross-datacenter state visibility (in general, chronology checks) Use GTID? Use pt-heartbeat? Needs discussion. Joe mentiones that needs to work for "regular/simple" non-WMF mediawiki setups. ** MASTER_POS_WAIT does not work cross-DC with current file/coords [Jaime will file a task] * Cross DC writes and TLS/SSL (e.g. writes via DeferredUpdates or CentralAuth autocreate) * Parsercache (not really DBs): General consensus on replacing the datastore from MySQL to somethings else with mult (which should eventually be done). Jaime proposes to do a couple of fixes to have something quicky. Media storage / Swift: * FileBackendMultiWrite 'async' upload /thumbnail race conditions - Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend - Experiment with sync/async and watch statsd for api entry point * HTTPS for swift: https://phabricator.wikimedia.org/T127455 Session storage / redis: * Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) * Blocked on TLS/SSL for apaches <=> redis (http://redis.io/topics/encryption not supported) * Maybe use another system (like a cassandra cluster?) (https://phabricator.wikimedia.org/T134811 ) ElasticSearch: * Basically ready CDN / traffic: * VCL routing logic: https://phabricator.wikimedia.org/T91820 ** Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this? ** Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?) * General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404 * Experiment with % of traffic to codfw (avoid loops?) Services: * change_propagation module for WAN cache purges