Wikimedia Developer Summit/2017/Integrating MediaWiki (and other services) with dynamic configuration/Full notes

From mediawiki.org

Giuseppe is exposing the problems that we have right now related to the live state of a cluster and how that differs from it's configuration. For example active/inactive hosts for services, how to switch datacenter, etc. Those things should not be defined in the configuration and/or changed with a commit in the configuration repository.

The common configuration repo that we are currently using is not the answer. The workflow is incovenient and in some ways wrong.

Configuration vs state

  • config is which hosts exist, how they are set up, how many HHVM threads are provisioned
  • state is which cluster is active, which hosts are online, is the db read-only, weight of server in load balancing layer

The test of failing over from primary DC to backup took a lot of people and a lot of coordination. we can do better

Giuseppe is now going through the possible solution that can help to solve this problem. The current tool used right now in the WMF is Conftool (<https://wikitech.wikimedia.org/wiki/Conftool>). Conftool uses etcd (<https://wikitech.wikimedia.org/wiki/Etcd>) as backend right now, that is a strongly consistent distributed key-value store. In addition allows for a client to watch for modifications of values in a set of keys.

Question from the audience: is this not in conflict with PuppetDB (<https://wikitech.wikimedia.org/wiki/Puppet#PuppetDB>)

Giuseppe's answer: this is completely different because PuppetDB is not suitable for the storage of state due to some inherent limitation of it both in terms of performance and usability due to it's non so friendly API.

Another tool that comes handy in the use of etcd is confd (<https://wikitech.wikimedia.org/wiki/Confd>), in particular when watching keys and when an application cannot easily be modified to integrate the interaction with etcd diretcly. As an example in the current production environment is varnish, it uses confd to connect and get the state from etcd.

DNS polling could be used for some simple state changes, e.g. pointing a service name to anohter host.

We need to keep in mind also the small scale installations of MediaWiki so any solution that will be chosen need to work also in a very simple installations where the need of this complex tools is not really needed.

Service discovery (SOA address book)[edit]

  • what is the active URL of a specific service for R/W access
  • what are the servers that are part of given cluster or service

DNS seems to be a natural candidate for the SOA discovery use case via CNAMES, TXT, URI records. Would need very short TTLs to allow quick changes which in turn requires a performant DNS service.

PHP has no concept of caching DNS records. For example in the mediawiki-config repository there are hard-coded IPs of services and hosts. In HHVM there is some caching and right now should be 5 minutes.

Question from bd808: Wouldn't the normal PHP solution be resolver level caching on the host?

Answer: that could be a solution, but interduces another level of cache

Question from Brandon: we have the capability to fail down dynamically with different timings so it's important to have the clients honor the TTLs in order to be sure that they will follow whatever policy we choose for each of them.

Giuseppe's answer: yes that need to be honored, for example before switching some service the TTL will probably be lowered to a very small value to allow for a quicker switch.

Examples:

$ dig +short -t TXT api.ro.discovery.wmnet
"https://api.svc.eqiad.wmnet/w/api.php"
$ dig +short -t TXT api.rw.discovery.wmnet
"https://api.svc.codfw.wmnet/w/api.php"

State management[edit]

Complex data that doesn't fit into the DNS paradigm

Question from Brandon: why those don't have a specific URL? Like a specific role of some hosts in a cluster Giuseppe's answer: there are cases in which this will be not enough, they are too complex to fit into the simple data structure that a DNS can return

  • Some changes require predicatable coordination (e.g. database switch)
  • Latency demands may mean that changes are needed faster than confd changes propigate
  • 3rd party apps

Suggestion from Brandon: why not fixing those services or languages that don't have a proper support of all the DNS feature and use the DNS for everything? It's basically a load balancer configuration.

Complex data structures (e.g. JSON blobs) do not fit well into DNS as TXT records

For example a simplified database configuration could be: {name: db1085, shard: s2, role: slave, api: false, vslow: false}

Confd takes in average less than a second to propagate all the changes across the cluster.

Comment/question from Brandon: it's still not atomic, so for example the RW/RO separation we need to pass through a state in which there is no RW, make it propagate across the cluster and then move to the new RW. So what kind of value will we set to notify the application that the RW is not available? Something like 127.0.0.1?

Answer: Yes that is a case that need to be taken into account to have proper failovers.

Latency: doing DNS resolutions all the times might introduce latency if done for each requests for example in a PHP website.

Solutions[edit]

  • Confd

It can have multiple backends, can watch for changes, can create files based on templates, run validation scripts on the generated file that will validate the generated result and NOT switch if invalid and also a custom post-creation script that could as an example prompt the app to notice the changed config.

For simple data structures the DNS is a safer approach because there are less places where the data is cached and is more easier to predict the current real applied state of a cluster while with confd it might have corner cases if something fails on specific hosts.

In code, watch an ETCD url, generate a json file, validate it, upload to APC cache via special URL. Although this adds another layer of cache in the HHVM APC.

Action items:

  • Stop relying in configuration repository to store the live state of clusters, it needs to be treated separately.
  • Use DNS for service location/discovery, and simple data structures, but has some limitations.
  • Use Confd with templates and scripts to manage more complex data structures.
  • For the MediaWiki specific case there are a bunch of options:
    •  generate a JSON file that is read by the application, parsed by MediaWiki and cached into APC.
    •  generate a PHP file that will be included directly by the application.
  • Add safe measures to ensure that the configuration is consistent across the cluster and no hosts have a stale one.

Q and A session:

  • bd808: Do we still have the "TC cache" issue with HHVM where it can exhaust cache and crash on PHP file change?
  • Giuseppe's answer:Yes, so writing a PHP file may still trigger that. There are some ideas about switching out the local cache that may help with that, but the more likely trigger is a branch deploy that changes a lot of code
  • related to the format for MediaWiki, my preference would be for the JSON format instead of a PHP file that need to be ignored by Scap or might have issues with the deployments
  • Giuseppe's answer: the generated file will be outside the code tree anyway. We could have a standard JSON format but might have consequences
  • Filippo: where is the line that separate the state from the configuration? How frequently and how much they changes?

  Giuseppe's answer: servers Up/Down is a state, but might be different for databases, for example the role of a DB (slow query, etc.)

  • Jaime: for emergency cases in which you want to be fast and want to push a button, that's state. I'm one with the most commits in mediawiki-config just for depooling/pooling database servers.
  • Ema: we do code reviews for configuration changes, we don't for state change, should we do code reviews for state changes?
  • Giuseppe's answer: we should log, and we already do with conftool to SAL, when changing state. Same with PyBal, we run a script that was code reviewed that made the state change.
  • Brandon: another way of seeing it is between what puppet changes as configuration and what we execute on the server to change it's state
  • Jaime: I agree with the problem, we have two solutions (DNS and etcd) and as we don't trust etcd we have scripts to change and manage etcd state and to apply them to the host and it's a bit scary with all those moving parts.
  • Brandon: it's nice to be flexible, like the inter-dc caching routing table is a custom configuration. For most of the things it will be a simple configuration and the DNS protocol is a fundamental protocol to everything and is the internet service discovery system. It's a problem that it's not properly implemented on the client side. We should go fix the language, and not accept to give up on DNS because a language doesn't support it well.
  • Giuseppe's answer: it's a complex task to fix PHP support of DNS but yeah DNS should be used as much as possible.
  • Mark: PyBal does it properly
  • Giuseppe: puppet doesn't which we found out the hard way
  • something about yes we could use local resolver cache as a bandaid for some languages
  • Brandon: we can have the logic in the DNS servers without re-implementing everything on the clients.

Conclusions[edit]

This quarter one of the goals is to make another datacenter switchover (<https://wikitech.wikimedia.org/wiki/Switch_Datacenter>) need to make that a less complex process. It's kinda of embarassing we needed more than 20 minutes to switch datacenter. We want to improved this and simplify the process. For most services will be completely transparent because it will be done via DNS. For MediaWiki is much more complex given all the moving parts. We have a variable that defines the active datacenter. In the short term we need to move the way that allow to change the state without having to commit to a configuration repository.

  • Marco: MediaWiki will need to support both ways.
  • Giuseppe: we don't want to run Puppet on a state change
  • bd808: every time I changed the DNS I had to commit to gerrit
  • Brandon: we have multiple ways of changing the DNS and we can change that. It's a tiny text file. The other tool that we have is to write a plugin that watch etcd for changes
  • Faidon: there are DNS out there that support already etcd, CoreDNS, SkyDNS
  • Giuseppe: I've looked at them, they are quite limited in the data structure that they support. I had to work hard to make it work in Kubernetes, so I would not go that way
  • Brandon: we could use the GeoIP decision making plugin to be datacenter aware and just have etcd trigger that logic
  • Giuseppe: I'm not sure about reliability and scalability of those other DNS and I'd prefer the one we're already using.
  • Filippo: what sort of checks we'll do on the data? Most obvious ones: empty configuration, 80% of the servers disappear, etc. what more?
  • Giuseppe: yes we need to do that, it can be done on the conftool side, but probably not the best way, or on the client side to not apply a configuration considered invalid, like PyBal does.
  • Brandon: for varnish we could write a plugin that watches keypaths on etcd and writes the VCL for the configuration.
  • Faidon: deamons that write config files for other deamons is a work around.
  • Giuseppe: yes, but it takes a bit of time a more than one iteration to get it right. Direct integration is surely the best option in the long term, but also confd supports multiple backends, while a direct plugin will need to be rewritten in case we move away from etcd
  • Brandon: but the price to pay for that abstraction is high, to not be locked in etcd. We might want to decide if etcd is the right tool and we don't plan to change it in the next 5 years.
  • Marco: the semantic of the API is right, so also moving away should not be that hard