Requests for comment/Multi datacenter strategy for MediaWiki/Progress

From mediawiki.org

References:

Multi DC strategy RFC:

Multi-DC sync-up meeting regular attendees:

  • Aaron
  • Stas
  • Gabriel
  • Brandon
  • Giuseppe
  • Filippo
  • Gilles
  • Timo
  • JaimeC


2016-08-17

MediaWiki:

Configuration:

Databases:

  • [done] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
  • [in progress] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )
    • Make sure cross-DB TLS new connections are rare (10x worse latency for opening connections vs non-SSL) - We already use it for replication (1 continous connection) with no visible overhead
    • Certificate management???
    • I need to coordinate with Performance and Availability to standarize all MySQL services with the same HA solution. That may require mediawiki changes so that most of https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php gets simplified to a single ip + port per "micro-service". Also probably those 2 files should disappear and only have db.php, given that we will have a single active-active setup (?) T141547

Media storage / Swift:

Session storage / redis:

    • [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
      • Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
      • What is the advantage of using restbase vs. direct cassandra?
        • RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase
        • https://phabricator.wikimedia.org/T140813
      • Last meeting affirmed cautious support for cassandra/hyperswitch
        • Idea of services team focusing on session before auth storage was floated (would be useful for multi-DC work)

CDN / traffic:

Services:

  • [in progress] look into mcrouter too see if it can work for WANCache
  • Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813
    • Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
    • ACTION: Gabriel to set up meeting for session storage next week.

The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Workboard: https://phabricator.wikimedia.org/tag/wikimedia-multiple-active-datacenters/ Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?

2016-08-03

MediaWiki:

Configuration:

   [unstarted] Switch parts of config to something like etcd.

Databases:

Media storage / Swift:

Session storage / redis:

    • [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
      • Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
      • What is the advantage of using restbase vs. direct cassandra?
        • RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase
        • https://phabricator.wikimedia.org/T140813
      • Last meeting affirmed cautious support for cassandra/hyperswitch
        • Idea of services team focusing on session before auth storage was floated (would be useful for multi-DC work)

CDN / traffic:

Services:

  • [unstarted] look into mcrouter too see if it can work for WANCache
  • Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813
    • Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
    • ACTION: Gabriel to set up meeting for session storage next week.

The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Use tracking ticket: ACTION: Aaron to create, discuss at next meeting. Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?


2016-07-20

MediaWiki:

Databases:

Media storage / Swift:

Session storage / redis:

CDN / traffic:

Services:

  • [unstarted] change_propagation module for CDN cache purges
  • [unstarted] look into mcrouter too see if it can work
  • [unstarted] develop xkey purge strategy: Brandon to set up initial brainstorm meeting
  • looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718
    • new librdkafka based node client looking good, starting beta testing; adds Kafka 0.9/0.10 support
  • Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813
    • Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
    • ACTION: Gabriel to set up meeting for session storage next week.

The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Use tracking ticket: ACTION: Aaron to create, discuss at next meeting. Aaron: I'd rather use a tag and board, TODO Blocking tasks are now all in etherpad now Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?

2016-06-22

MediaWiki:

Databases:

Media storage / Swift:

  • [done] Experiment with sync/async and watch statsd for api entry point for multiwrite backend
    • done and left on; no noticeable effect on api entry points
  • [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis:

CDN / traffic:

Services:

2016-06-08

MediaWiki:

Databases:

Media storage / Swift:

Session storage / redis:

CDN / traffic:

  • [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
      • VCL or Apache proxying?
      • Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
      • Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
        • Too many deferred updates and a few sync exceptions (writes will be cross-DC then)
  • [status?] General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404

Services:

  • change_propagation module for CDN cache purges
  • [unstarted] look into mcrouter too see if it can work
  • looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718


2016-05-25

MediaWiki:

Databases:

Media storage / Swift:

  • FileBackendMultiWrite 'async' upload /thumbnail race conditions
    • Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend
    • Experiment with sync/async and watch statsd for api entry point
  • HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis:

CDN / traffic:

  • VCL routing logic: https://phabricator.wikimedia.org/T91820
      • Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
      • Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
  • General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404
  • Experiment with % of traffic to codfw (avoid loops?)

Services:

  • change_propagation module for WAN cache purges

2016-05-11

ACTION ITEMS:


MediaWiki:

Databases:

  • pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
  • Related: cross-datacenter state visibility (in general, chronology checks) Use GTID? Use pt-heartbeat? Needs discussion. Joe mentiones that needs to work for "regular/simple" non-WMF mediawiki setups.
    • MASTER_POS_WAIT() does not work cross-DC with current file/coords [Jaime will file a task]
  • Cross DC writes and TLS/SSL (e.g. writes via DeferredUpdates or CentralAuth autocreate)
  • Parsercache (not really DBs): General consensus on replacing the datastore from MySQL to somethings else with mult (which should eventually be done). Jaime proposes to do a couple of fixes to have something quicky.

Media storage / Swift:

  • FileBackendMultiWrite 'async' upload /thumbnail race conditions
    • Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend
    • Experiment with sync/async and watch statsd for api entry point
  • HTTPS for swift: https://phabricator.wikimedia.org/T127455

Session storage / redis:

ElasticSearch:

  • Basically ready

CDN / traffic:

  • VCL routing logic: https://phabricator.wikimedia.org/T91820
      • Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
      • Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
  • General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404
  • Experiment with % of traffic to codfw (avoid loops?)

Services:

  • change_propagation module for WAN cache purges