Wikimedia Release Engineering Team/Book club/Continuous Delivery

= First meeting =
 * Discussion: 2019-03-06 17:00 UTC
 * See also: Martin Fowler doing a 20 minute presentation on the topic: https://www.youtube.com/watch?v=aoMfbgF2D_4

Personal notes

 * Lars's notes: https://files.liw.fi/temp/cdbook.mdwn
 * thcipriani really long notes: https://people.wikimedia.org/~thcipriani/continuous-delivery-book/continuous-delivery-book.html

Preface

 * “How long would it take your organization to deploy a change that involves just one single line of code? Do you do this on a repeatable, reliable basis?” p-xxiii
 * JR: I'm working form the premis that I'm reading this book to learn something. To that end, should we be looking to ID "nuggets" of knowledge from each chapter and perhaps how we might implement that new learning?

Chapter 1: The problem of delivering software

 * Greg: single button deploy is a laudable goal, but I wonder how often that is reality?
 * Jeena: I wonder if they are talking about an "environment" that's already setup, so a single click may be less impossible
 * Brennen: I think it's possible to get pretty close to a single-button spin-up env
 * liw: for personal needs I use VMs rented from cloud providers, and have done that in a previous job, setting up DNS and env in one command, deploying software is 2nd command, but it could be one command. It took a lot of work to get this done. Virtual Machines make this a lot easier because you don't need to assemble a new computer to setup a new computer.
 * Tyler: We do configuration management entirely wrong. eg: We've had a db since 2001, so all of our puppet config assumes there is a db somewhere. Does this prevent us from spinning up an env with one click? Will that prevent us from doing the Deployment Pipeline?
 * Jeena: not sure it would prevent us, we can always try out things outside of production until we're satisfied that it's good enough then migrate
 * Brennen: agree we probably do everything wrong, but it's not impossible to get better (he says with a smile)
 * Lars: nothing here in current things/legacy that prevent us from building a deployment pipeline, maybe more work and not as clean as we'd like, but it shouldn't prevent us.
 * Lars: we do do everything wrong, including PHP use :), we also do many things right, we use VCS
 * Lars: compared to the world the book describes 10 years ago, we're light years ahead. Even at orgs with massively larger budgets, they have it worse. In 2019 they have people still editing live files in production.
 * Antoine: The book is too idealistic.
 * "Antipattern: Deploying Software Manually" p5
 * "Releasing software should be easy. " p24

Chapter 2: Configuration management

 * Tyler: We don't actually keep everything in VCS. MW as deployed currently, there is no exact version of what is deployed. Lots in .gitignore. Lots generated at deploy time. You can't spin up a deploy server at the current state it's in. There's maybe a dozen people who could rebuild a deploy server from scratch. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 * Tyler: our dev envs aren't able to be "production-like" because production-like in our case means build by hand <--- THIS
 * "It should always be cheaper to create a new environment than to repair an old one." p50
 * Lars: I agree when the book says that if something is painful, it should be done more often. Then you get rid of the pain. If the deployment server is hard to rebuild, then we should throw it away and rebuild it for every deployment
 * Lars: reproducible builds. See the Debian project where they rebuild everything twice and if they're not identical it reports an error. That is unrealistic for us at this time. Maybe wait until Debian has solved this problem for us
 * https://wiki.debian.org/ReproducibleBuilds https://reproducible-builds.org/
 * CI view/dashboard https://tests.reproducible-builds.org/debian/reproducible.html
 * Brennen: good for some artifacts, but for some of what we're going to do maybe you can't have that requirement
 * Tyler: one of the benefits of the pipeline is that right now we don't have confidence in third-party dependencies, to build that confidence we wanted a progressive pipeline of tests to verify.
 * Lars: even if it's reproducible, the source code might be wrong
 * Tyler: Visibility into environments. Do we have a good visibility of versions deployed into deployment-prep? "Dont' break the build == don't break Beta".
 * Brennen: long timers know where to look. I can tell info is available, but I have no idea where they are. We could work on discoverability/front and centerness for the holistic overview.
 * Greg: Dashboards dashboards dashboards.
 * Tyler: maybe too many dashboards
 * Greg: yes, we should make a better clearing house.
 * Brennen: if it's automated then it doesn't need documentation, wtf?!
 * Lars: I agree with the book, conditionally. Well written scripts can be documentation of how things are done. Most manual process descriptions are better shown as well written scripts. Checklists get out of date etc. Worse than not having a bad script.
 * Zeljko: especially true with installation/setup. There's no documenation for how Docker or Vagrant does that, the script is the documentation.
 * Brennen: true, but that doesn't mean we should document how the sytems work. Even with self-documenting scripts, they sometimes fail and then you can't figure it out unless you already know how it's supposed to work. Context should be available to explain what's going on.
 * Jeena: I can't think of many times I'm looking through the code and think it's good documenation for itself.
 * JR: documentation should explain conceptually what it's trying to accomplish, not the step by step. Balance. Documentation is important.
 * Tyler: code tells you "how" not "why"
 * Lars: we need testable documentation.

Chapter 3: Continuous Integration
TODO: Look into speed of tests and potentially failing.
 * Jeena: you should never break the build when you check in to master (as opposed to their statement "just fix it after")
 * Greg: Good emphasis on various statges of testing, etc. Keeping commit stage tests to less than 10 min is a great goal.  Also make a point to continually analyze how long tests are taking and optimize so your devs don't get annoyed.  These are things we should be aware of / exposing.
 * Greg: Only in exceptional circumstances should you use shared environments for development.
 * Brennen: the best env I ever used was a bunch of VMs on a box, which is better than any local dev env I've ever given
 * Jeena: I don't know why they'd expect a separate db/etc for everyone to be realistic
 * Lars: I think the background is if there's a db then it's too easy for one dev to break it for everyone else.
 * Lars: not opposed to having them, but understand where the book is coming from
 * Tyler: "is it beta?" Developers generally know when what they're doing is going to impact others (eg: destroying a db). If they do that then just make a copy for themselves to play with (paraphrased)
 * Brennen: "build scripts should be treated like your code base". Deployment is a process on a wiki page, we all know its bad, the book keeps reminding us about this :)
 * Lars: it could be so much worse :)
 * Jeena: blamey culture "know who broke the build and why"
 * Zeljko: if things break, everyone's highest priority is to fix it
 * Jeena: could be mitigated by gating and not merging to master
 * Zeljko: I think the book is in an SVN world, not git
 * Lars: hints of blaming who made a mistake, which is an anti-pattern. Makes people really careful of making changes.
 * Greg: "Get your build process to perform analysis of your source code (test coverage, code duplication, adherence to coding standards, cyclomatic complexity, etc) "
 * Tyler: don't feel like the authors are doing a great job of highlighting the benefits of doing things. Good at highlighting best practices. Example is commit messages. You need the WHY in a commit message, not just the what. I can see the what in the code. Not to blame, but so I can understand.
 * Mukunda: Phab/Differential takes it a few steps beyond. The default config requires a test plan (verbal evidence that you tested) and revert plan.
 * Greg: Fail commit tests if they take too long?
 * JR: Z and I were talking about this earlier. Good idea to investigate.
 * Zeljko: some places don't use commit vs integration but instead fast vs slow.
 * JR: Google small, medium, large
 * Lars: speed is a good indicator, but there's other aspects like be able to be run without external things (db, fs, network). Can be enforced if necessary.
 * Lars: if we have a specific unit test phase, we should fail it if it takes too long even if all the tests pass. If the machine happens to be slow then too bad. I did this with my python unit test runner, it can be set to do this.
 * Tyler: We fail after 45 minutes.
 * commit = test stage
 * acceptance? == gate and submit
 * Zeljko: "Everyone associated with development—project managers, analysts, developers, testers—should have access to, and be accessible to, everyone else on IM and VoIP. It is essential for the smooth running of the delivery process to fly people back and forth periodically, so that each local group has personal contact with members from other groups." p75
 * but we have a communication problem with too many mediums of communication

Chapter 4: Implementing a Testing Strategy

 * Tyler: a lot of emphasis on cucumber style testing, moreso than anythign else
 * JR: bluuuuguh
 * Antoine: cucumber was trendy
 * Lars: http://git.liw.fi/cgi-bin/cgit/cgit.cgi/cmdtest/tree/README.yarn is inspired by cucumber and explicitly aims to have test suites be documentation
 * "The design of a testing strategy is primarily a process of identifying and prioritizing project risks and deciding what actions to take to mitigate them." p84
 * JR: testing pyramid etc
 * Zeljko: I actually love cucumber. We tried here. It failed because developers used it. We didn't use it as a communication tool (which is what it is meant for). It was just overhead.
 * Zeljko: I dont think any team has a clear testing strategy. Finding project risks and deciding what to do about them in terms of testing. A valid strategy would be to have no tests *if* after reviewing the risk you decide that the risks are not going to be mitigated by tests.
 * Jeena: are devs responsible for writing them? (Yes)
 * Tyler: devs write them, but Antoine is responsbile for how they run
 * "The best way to introduce automated testing is to begin with the most common, important, and high-value use cases of the application. This will require conver- sations with your customer to clearly identify where the real business value lies" p94
 * 

Chapter 5: Anatomy of the Deployment Pipeline

 * "In many organizations where automated functional testing is done at all, a common practice is to have a separate team dedicated to the production and maintenance of the test suite. As described at length in Chapter 4, “Implementing a Testing Strategy,” this is a bad idea. The most problematic outcome is that the developers don’t feel as if they own the acceptance tests." p125
 * "developers must be able to run automated acceptance tests on their development environments" p125
 * "While acceptance tests are extremely valuable, they can also be expensive to create and maintain." p126
 * "Automated acceptance testing is what frees up time for testers so they can concentrate on these high-value activities, instead of being human test-script execution machines." p128
 * Lars: we should start measuring cycle time, but maybe not as the book defines it. From the first push to gerrit to when it's in production.
 * Lars: should start measuring and graphing it now.
 * TODO: ^^^^
 * Tyler: we should build it into the pipeline itself
 * Tyler: the peice we're missing is metadata about deployments, see eg: https://gist.github.com/thcipriani/8c2dfc746591342c4bc332e5bccc9226
 * Lars: at All Hands, we wanted to brag about what we achieved last year, if be great if we could say we did 12,237 deploys with a avg cycle-time and cycle time is so and so many seconds
 * 
 * Jeena: what's the bottleneck? Aka use metrics to identify the main bottleneck and address that first.
 * Antoine: discussion on the logspam. We spend lot of time dealing with them during train deployment.
 * Antoine: logs are not really feeded back to developers as a whole (but some do look at them)
 * Greg: Takeaways:
 * Main thing: Lack of ability for us to have a testing environment that we can easily create multiple of - push-button testing environment.  I'm sure Tyler's heartrate just increased.  We can't lead that work by ourselves.  We need help from SRE etc.  Keep on a tickler with SRE?
 * JR: Pain around testing environment (staging discussion) - those are hard discussions because the environment was growing in terms of requirements. Felt like initial discussion was about replacing beta with a staging env; but instead how do we satisfy a bunch of other requirements...  Some of this may not be as tightly coupled to SRE.  What are things besides staging environment that we could develop?
 * Greg: The solution isn't to make beta better.
 * Tyler: I disagree. We need to make beta better.  It's the only thing close to production that we have.
 * Jeena: Not sure I understand beta very well, but I think we could utilize pieces that are already there to be able to also make separate envs that are disposable.
 * Greg: Improving beta seems like fixing instead of rebuilding, is that feasible?
 * Tyler: I agree that it's a question of fixing rather than building a new one. I don't think it's possible to build a new production like thing - production was smaller then.  I think we should be building new things, new environments that approach production, but beta is as close to production like as we will get without SRE building staging.  I don't think we could ever have a staging that's production-like that we can tear down and spin back up - we _can't do that with production itself_.  Because it's hard to build a prod-like environment, we have a bunch of hardware sitting idle.
 * From chat: This is a political problem.
 * Tyler: discussed during thursday meeting. There will be a staging at some point. Will be supplanting some uses of Beta (mostly the pre prod deploy). They don't want devs to ever touch staging. We also need a place were devs can touch and change things.
 * Jeena: how can we use producationt o make it reproducible
 * Mukunda: we don't have the rights/political capital
 * Tyler: we can use Beta as this venue, it's the same puppet.
 * Tyler: Joe had a proposal on how to help beta from a while ago
 * TODO: Tyler dig that up.... ✅ https://phabricator.wikimedia.org/T161675
 * Lars: what are your overall opinoins so far, has it been intersting/useful?
 * Greg: it's useful as a framework and a means to communicate and a reminder to do the things we know we need to do
 * Brennen: sometimes the most interesting bits are the parts I disagree with
 * Zeljko: reading it will be a good exercise for us to find out what we agree one etc (do we really want all libraries checked in, push button deploys, etc)