Wikimedia Release Engineering Team/Book club/Continuous Delivery

First meeting (2019-03-07)

 * Discussion: 2019-03-06 17:00 UTC
 * See also: Martin Fowler doing a 20 minute presentation on the topic: https://www.youtube.com/watch?v=aoMfbgF2D_4

Personal notes

 * Lars's notes: https://files.liw.fi/temp/cdbook.mdwn
 * thcipriani really long notes: https://people.wikimedia.org/~thcipriani/continuous-delivery-book/continuous-delivery-book.html

Preface

 * “How long would it take your organization to deploy a change that involves just one single line of code? Do you do this on a repeatable, reliable basis?” p-xxiii
 * JR: I'm working form the premis that I'm reading this book to learn something. To that end, should we be looking to ID "nuggets" of knowledge from each chapter and perhaps how we might implement that new learning?

Chapter 1: The problem of delivering software

 * Greg: single button deploy is a laudable goal, but I wonder how often that is reality?
 * Jeena: I wonder if they are talking about an "environment" that's already setup, so a single click may be less impossible
 * Brennen: I think it's possible to get pretty close to a single-button spin-up env
 * liw: for personal needs I use VMs rented from cloud providers, and have done that in a previous job, setting up DNS and env in one command, deploying software is 2nd command, but it could be one command. It took a lot of work to get this done. Virtual Machines make this a lot easier because you don't need to assemble a new computer to setup a new computer.
 * Tyler: We do configuration management entirely wrong. eg: We've had a db since 2001, so all of our puppet config assumes there is a db somewhere. Does this prevent us from spinning up an env with one click? Will that prevent us from doing the Deployment Pipeline?
 * Jeena: not sure it would prevent us, we can always try out things outside of production until we're satisfied that it's good enough then migrate
 * Brennen: agree we probably do everything wrong, but it's not impossible to get better (he says with a smile)
 * Lars: nothing here in current things/legacy that prevent us from building a deployment pipeline, maybe more work and not as clean as we'd like, but it shouldn't prevent us.
 * Lars: we do do everything wrong, including PHP use :), we also do many things right, we use VCS
 * Lars: compared to the world the book describes 10 years ago, we're light years ahead. Even at orgs with massively larger budgets, they have it worse. In 2019 they have people still editing live files in production.
 * Antoine: The book is too idealistic.
 * "Antipattern: Deploying Software Manually" p5
 * "Releasing software should be easy. " p24

Chapter 2: Configuration management

 * Tyler: We don't actually keep everything in VCS. MW as deployed currently, there is no exact version of what is deployed. Lots in .gitignore. Lots generated at deploy time. You can't spin up a deploy server at the current state it's in. There's maybe a dozen people who could rebuild a deploy server from scratch. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 * Tyler: our dev envs aren't able to be "production-like" because production-like in our case means build by hand <--- THIS
 * "It should always be cheaper to create a new environment than to repair an old one." p50
 * Lars: I agree when the book says that if something is painful, it should be done more often. Then you get rid of the pain. If the deployment server is hard to rebuild, then we should throw it away and rebuild it for every deployment
 * Lars: reproducible builds. See the Debian project where they rebuild everything twice and if they're not identical it reports an error. That is unrealistic for us at this time. Maybe wait until Debian has solved this problem for us
 * https://wiki.debian.org/ReproducibleBuilds https://reproducible-builds.org/
 * CI view/dashboard https://tests.reproducible-builds.org/debian/reproducible.html
 * Brennen: good for some artifacts, but for some of what we're going to do maybe you can't have that requirement
 * Tyler: one of the benefits of the pipeline is that right now we don't have confidence in third-party dependencies, to build that confidence we wanted a progressive pipeline of tests to verify.
 * Lars: even if it's reproducible, the source code might be wrong
 * Tyler: Visibility into environments. Do we have a good visibility of versions deployed into deployment-prep? "Dont' break the build == don't break Beta".
 * Brennen: long timers know where to look. I can tell info is available, but I have no idea where they are. We could work on discoverability/front and centerness for the holistic overview.
 * Greg: Dashboards dashboards dashboards.
 * Tyler: maybe too many dashboards
 * Greg: yes, we should make a better clearing house.
 * Brennen: if it's automated then it doesn't need documentation, wtf?!
 * Lars: I agree with the book, conditionally. Well written scripts can be documentation of how things are done. Most manual process descriptions are better shown as well written scripts. Checklists get out of date etc. Worse than not having a bad script.
 * Zeljko: especially true with installation/setup. There's no documenation for how Docker or Vagrant does that, the script is the documentation.
 * Brennen: true, but that doesn't mean we should document how the sytems work. Even with self-documenting scripts, they sometimes fail and then you can't figure it out unless you already know how it's supposed to work. Context should be available to explain what's going on.
 * Jeena: I can't think of many times I'm looking through the code and think it's good documenation for itself.
 * JR: documentation should explain conceptually what it's trying to accomplish, not the step by step. Balance. Documentation is important.
 * Tyler: code tells you "how" not "why"
 * Lars: we need testable documentation.

Chapter 3: Continuous Integration
TODO: Look into speed of tests and potentially failing.
 * Jeena: you should never break the build when you check in to master (as opposed to their statement "just fix it after")
 * Greg: Good emphasis on various statges of testing, etc. Keeping commit stage tests to less than 10 min is a great goal.  Also make a point to continually analyze how long tests are taking and optimize so your devs don't get annoyed.  These are things we should be aware of / exposing.
 * Greg: Only in exceptional circumstances should you use shared environments for development.
 * Brennen: the best env I ever used was a bunch of VMs on a box, which is better than any local dev env I've ever given
 * Jeena: I don't know why they'd expect a separate db/etc for everyone to be realistic
 * Lars: I think the background is if there's a db then it's too easy for one dev to break it for everyone else.
 * Lars: not opposed to having them, but understand where the book is coming from
 * Tyler: "is it beta?" Developers generally know when what they're doing is going to impact others (eg: destroying a db). If they do that then just make a copy for themselves to play with (paraphrased)
 * Brennen: "build scripts should be treated like your code base". Deployment is a process on a wiki page, we all know its bad, the book keeps reminding us about this :)
 * Lars: it could be so much worse :)
 * Jeena: blamey culture "know who broke the build and why"
 * Zeljko: if things break, everyone's highest priority is to fix it
 * Jeena: could be mitigated by gating and not merging to master
 * Zeljko: I think the book is in an SVN world, not git
 * Lars: hints of blaming who made a mistake, which is an anti-pattern. Makes people really careful of making changes.
 * Greg: "Get your build process to perform analysis of your source code (test coverage, code duplication, adherence to coding standards, cyclomatic complexity, etc) "
 * Tyler: don't feel like the authors are doing a great job of highlighting the benefits of doing things. Good at highlighting best practices. Example is commit messages. You need the WHY in a commit message, not just the what. I can see the what in the code. Not to blame, but so I can understand.
 * Mukunda: Phab/Differential takes it a few steps beyond. The default config requires a test plan (verbal evidence that you tested) and revert plan.
 * Greg: Fail commit tests if they take too long?
 * JR: Z and I were talking about this earlier. Good idea to investigate.
 * Zeljko: some places don't use commit vs integration but instead fast vs slow.
 * JR: Google small, medium, large
 * Lars: speed is a good indicator, but there's other aspects like be able to be run without external things (db, fs, network). Can be enforced if necessary.
 * Lars: if we have a specific unit test phase, we should fail it if it takes too long even if all the tests pass. If the machine happens to be slow then too bad. I did this with my python unit test runner, it can be set to do this.
 * Tyler: We fail after 45 minutes.
 * commit = test stage
 * acceptance? == gate and submit
 * Zeljko: "Everyone associated with development—project managers, analysts, developers, testers—should have access to, and be accessible to, everyone else on IM and VoIP. It is essential for the smooth running of the delivery process to fly people back and forth periodically, so that each local group has personal contact with members from other groups." p75
 * but we have a communication problem with too many mediums of communication

Chapter 4: Implementing a Testing Strategy

 * Tyler: a lot of emphasis on cucumber style testing, moreso than anythign else
 * JR: bluuuuguh
 * Antoine: cucumber was trendy
 * Lars: http://git.liw.fi/cgi-bin/cgit/cgit.cgi/cmdtest/tree/README.yarn is inspired by cucumber and explicitly aims to have test suites be documentation
 * "The design of a testing strategy is primarily a process of identifying and prioritizing project risks and deciding what actions to take to mitigate them." p84
 * JR: testing pyramid etc
 * Zeljko: I actually love cucumber. We tried here. It failed because developers used it. We didn't use it as a communication tool (which is what it is meant for). It was just overhead.
 * Zeljko: I dont think any team has a clear testing strategy. Finding project risks and deciding what to do about them in terms of testing. A valid strategy would be to have no tests *if* after reviewing the risk you decide that the risks are not going to be mitigated by tests.
 * Jeena: are devs responsible for writing them? (Yes)
 * Tyler: devs write them, but Antoine is responsbile for how they run
 * "The best way to introduce automated testing is to begin with the most common, important, and high-value use cases of the application. This will require conver- sations with your customer to clearly identify where the real business value lies" p94
 * 

Chapter 5: Anatomy of the Deployment Pipeline

 * "In many organizations where automated functional testing is done at all, a common practice is to have a separate team dedicated to the production and maintenance of the test suite. As described at length in Chapter 4, “Implementing a Testing Strategy,” this is a bad idea. The most problematic outcome is that the developers don’t feel as if they own the acceptance tests." p125
 * "developers must be able to run automated acceptance tests on their development environments" p125
 * "While acceptance tests are extremely valuable, they can also be expensive to create and maintain." p126
 * "Automated acceptance testing is what frees up time for testers so they can concentrate on these high-value activities, instead of being human test-script execution machines." p128
 * Lars: we should start measuring cycle time, but maybe not as the book defines it. From the first push to gerrit to when it's in production.
 * Lars: should start measuring and graphing it now.
 * TODO: ^^^^
 * Tyler: we should build it into the pipeline itself
 * Tyler: the peice we're missing is metadata about deployments, see eg: https://gist.github.com/thcipriani/8c2dfc746591342c4bc332e5bccc9226
 * Lars: at All Hands, we wanted to brag about what we achieved last year, if be great if we could say we did 12,237 deploys with a avg cycle-time and cycle time is so and so many seconds
 * 
 * Jeena: what's the bottleneck? Aka use metrics to identify the main bottleneck and address that first.
 * Antoine: discussion on the logspam. We spend lot of time dealing with them during train deployment.
 * Antoine: logs are not really feeded back to developers as a whole (but some do look at them)
 * Greg: Takeaways:
 * Main thing: Lack of ability for us to have a testing environment that we can easily create multiple of - push-button testing environment.  I'm sure Tyler's heartrate just increased.  We can't lead that work by ourselves.  We need help from SRE etc.  Keep on a tickler with SRE?
 * JR: Pain around testing environment (staging discussion) - those are hard discussions because the environment was growing in terms of requirements. Felt like initial discussion was about replacing beta with a staging env; but instead how do we satisfy a bunch of other requirements...  Some of this may not be as tightly coupled to SRE.  What are things besides staging environment that we could develop?
 * Greg: The solution isn't to make beta better.
 * Tyler: I disagree. We need to make beta better.  It's the only thing close to production that we have.
 * Jeena: Not sure I understand beta very well, but I think we could utilize pieces that are already there to be able to also make separate envs that are disposable.
 * Greg: Improving beta seems like fixing instead of rebuilding, is that feasible?
 * Tyler: I agree that it's a question of fixing rather than building a new one. I don't think it's possible to build a new production like thing - production was smaller then.  I think we should be building new things, new environments that approach production, but beta is as close to production like as we will get without SRE building staging.  I don't think we could ever have a staging that's production-like that we can tear down and spin back up - we _can't do that with production itself_.  Because it's hard to build a prod-like environment, we have a bunch of hardware sitting idle.
 * From chat: This is a political problem.
 * Tyler: discussed during thursday meeting. There will be a staging at some point. Will be supplanting some uses of Beta (mostly the pre prod deploy). They don't want devs to ever touch staging. We also need a place were devs can touch and change things.
 * Jeena: how can we use producationt o make it reproducible
 * Mukunda: we don't have the rights/political capital
 * Tyler: we can use Beta as this venue, it's the same puppet.
 * Tyler: Joe had a proposal on how to help beta from a while ago
 * TODO: Tyler dig that up.... ✅ https://phabricator.wikimedia.org/T161675
 * Lars: what are your overall opinoins so far, has it been intersting/useful?
 * Greg: it's useful as a framework and a means to communicate and a reminder to do the things we know we need to do
 * Brennen: sometimes the most interesting bits are the parts I disagree with
 * Zeljko: reading it will be a good exercise for us to find out what we agree one etc (do we really want all libraries checked in, push button deploys, etc)

Personal notes

 * Lars' notes for ch 6 and 7: https://files.liw.fi/temp/ch6-7.mdwn (will be deleted eventually)

Željko's quotes from the book

 * 1) chapter 6

p 153: Use the Same Scripts to Deploy to Every Environment

p 166: Test Targets Should Not Fail the Build In some build systems, the default behavior is that the build fails immediately when a task fails. This is almost always a Bad Thing—instead, record the fact that the activity has failed, and continue with the rest of the build process


 * 1) chapter 7

p 169 Introduction Ideally, a commit stage should take less than five minutes to run, and certainly no more than ten.

p170 Commit Stage Principles and Practices p171 Provide Fast, Useful Feedback p172 A common error early in the adoption of continuous integration is to take the doctrine of “fail fast” a little too literally and fail the build immediately when an error is found. We only stop the commit stage if an error prevents the rest of the stage from running—such as, for example, a compilation error. Otherwise, we run the commit stage to the end and present an aggregated report of all the errors and failures so they can all be fixed at once.

p177 Commit Test Suite Principles and Practices The vast majority of your commit tests should be comprised of unit tests Sometimes we fail the build if the suite isn’t sufficiently fast.

p178 Figure 7.3 Test automation pyramid

Chapter 6: Anatomy of the Deployment Pipeline

 * Z: Thoughts on our deployment process - we changed from https to ssh for push...
 * Z: use the same scripts (p 153): we don't, eg for Beta, only way to test is to do it in production
 * Z: fixing documentation as I was doing it
 * Z: not all of us were using the same process, eg: mukunda was already using gerrit over ssh instead of the http auth
 * Greg: It's very easy for parallel deploy processes to spring up that only devs / ops use, we're at the point where we have multiple deploy scripts, effectively. Were there changes needed for deployments in Beta?
 * Tyler: It's really due to branch cutting.
 * Lars: A lot of this is obsolete (on technical specifics) - but I agree with the point that we should all do deployments the same way everywhere, and agree with the further point that we should use the same script for all deployments. We're not doing that partly because we can't set up a local dev environment that resembles production.
 * Dan: We do use scap to deploy to the beta cluster, so the tools are the same, but the environment is still drastically different.
 * Z: I think the point of this was you should test your deployment scripts from local -> intermediate -> production. If you deploy with the same script multiple times a day, things are tested thousands of times.
 * Dan: I think that's actually a false sense of security. At some point your processes have to be the same, and just because you're using the same script to push to one place as another, but it seems like there's more to this, especially in our envs.  The process that we use to deploy to beta cluster is not aligned with the process for production.  We don't actually stage  - we have a jenkins script doing that deployment.  The cadence of deployment differs so drastically taht we don't get a good sense of how the prod deploy will go based on what's happening in beta.
 * Z: I think what the book says is that your environments should be similar.
 * B: emphasis on testing the deployment system/having the deployment system interogate the environment
 * T: smoke tests?
 * L: not smoke tests, but checking that extensions are in-place and possibly doing a black-box test that verifies this
 * Z: I'm glad that scap has canaries
 * T: Things that scap checks:
 * that files exist on disk
 * runs a basic check that PHP exists
 * checks to see the app isn't throwing errors on stdout
 * does linting of JSON and PHP
 * does smoke tests at the end - deploys to canaries, then checks canaries
 * L: Seems like we're not doing things badly - perhaps we can do better in future in some ways.
 * D: That approach works in production because we have a lot of traffic. Not as much in beta.
 * T: Currently using service checker - https://en.wikipedia.org/spec.yaml
 * D: Service checker might be easier than Selenium etc.
 * T: We're currently not using it to its fullest, would like to get to the point of doing that before canaries
 * TODO: Look into running service-checker on the deployment host before pushing out to the canaries
 * L: Would like to bring up that we don't have ways of bringing up environments. We could do with fairly frequent reviews of what the deployment process looks like.  We should all be involved.
 * Z: All should deploy.
 * L: Yes, but also we should all be involved in automating the deployment.
 * Z: Two reasons everyone should deploy:
 * I don't think everybody hates deployment as much as they should.
 * different skill sets means that we have different improvments to build a robust deployment system
 * https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Skill_matrix
 * T: we rotate, but we don't do pairing on that, so we're just focused on doing it, not improving it, we don't have the capacity to even think about those improvements
 * B: Emphasis on pairing for deploys seems more important to me having watched this happen.
 * (Discussion of how bad or not the train experience is.)
 * J: I don't think a human should have to do it :)
 * L: I agree.
 * (Some more discussion of pairing, etc.)
 * G: To tie a bow on this - would be worth experimenting with pairing with one person focusing on automation for a quarter or something. Would be a good experiment to timebox and see what could be accomplished.  Tradeoff we have to remember is that whatever investment we put into scap is going to be lost in a year and a half.
 * Z: Thought I had about pairing was a new person + existing team member... Maybe we need a serious level up on scap, etc.
 * T: My thing about it's not worth inversting time now - there are cultural shifts that are going to have to happen for continous deployment. I think those are things that if we invest some time in scap with the view that our goal is continuous...  That sort of work won't necessarily be lost.
 * J: Maybe we just work faster to get rid of it.
 * L: Backports?
 * M: backport specifically meaning cherry-pick to the deployed production branch - the thing that SWAT does a lot of
 * T: Fixing stuff right now.
 * G: Porting code changes is easier than porting the bullet points on a wiki page.
 * M: Don't think devs will push back on continuous deployment.
 * B: There's gotta be stuff to automate.
 * T: I think the list of bullet points on the wiki page means we ran out of stuff to fix. Automating the entire train should be possible.
 * D: The manual process right now is basically bouncing around from one disparate tool to another.
 * Z: One tool to rule them all?
 * D: Either port the other tools into scap or wrap them up in something.
 * (Points from chat about rolling branch cutting / image that contains /srv/mediawiki-staging/ into the pipeline.)
 * L: logspam being something we can't automate?
 * Z: yes, it's a pain to look through the logstash dashboard, hunting people down and getting them to fix it is a pain
 * L: anything that looks like an error halts the deployment and it goes to the developer to fix it
 * Z: that's basically the process now
 * L: if there's anything in logs we abort without question
 * Z: if there's anything I still block it from going forward (iow: not roll back, unless there's other issues)
 * B: could we start with aborting on any _increase_ in logspam? i feel like the book mentioned something like this at one point.
 * L: CDep will make this easier as the amout of change will be small (one change at a time) so we know what the cause is
 * B: can we make a concerted effort to get there?
 * Z:
 * M: we have tried for a while
 * G: Tooling is partly to blame for the current state of affairs.
 * T: MediaWiki known errors dashboard
 * JR: This is the same argument that I hear when talking about TechDebt. I think saying "stop everything" is a non-starter.  However, coming up with a plan to work towards "0" could encompass a "no new spam" and "remove existing log spam in small increments" approach...
 * M: what we need is statistical anomaly detection on the error log _rate_
 * T: There's a lot of pressure from the outside - chapter 1 - the process is subverted to meet the timeframe. That's what people mean when they say this isn't workable.
 * L: Other crazy idea:
 * Get a way to set up a production-like env
 * Smoke test / etc.
 * Continuous deployment there
 * Notify people who make changes if there are errors in the logs
 * T: There are Icinga alerts for production if error rate goes above... Something. And beta may mirror production there.  Beta is continuously deployed to, but it's really not production-like.  We'd have to create a more-production-like beta...
 * D: Say the word...
 * The greek chorus: STAGING
 * J: maybe if we make this new environment we could also work on automating DB creation
 * T: Mukunda proposed a long time ago the long-lived branches thing - merging to a branch rather than cutting a new branch every week might be a prereq for being able to deploy smaller groups of changes in a more continuous fashion.
 * M: The way I see it the best place we could get to in near future is to be always SWATing. That requires everything to go through a process.
 * T: That requires long-lived branches.
 * M: Sure - always swatting to the deployed branches.
 * D: A lot of what we're discussing in this are things we've considered for the pipeline design - incrememental test gating, etc., so we could invest time on our existing process and tooling or we could invest that into implementing the timeline.
 * T: I'm curious how much of this is just necessary prerequisite work for the pipeline.


 * (Discussion of pairing goals, logspam as a long-term problem and whether we should make that a goal.)

Chapters 7, 8, and 9 for next time.

Prepared notes

 * Lars
 * ch 6-7: https://files.liw.fi/temp/ch6-7.mdwn
 * ch 8-9: https://files.liw.fi/temp/ch8-9.mdwn

Chapter 7: The commit stage

 * Greg: increasing unit-test coverage as you go/not letting it go lower/ratcheting would be a good thing for us to try
 * p170 Commit Stage Principles and Practices
 * p171 Provide Fast, Useful Feedback
 * p172 A common error early in the adoption of continuous integration is to take the doctrine of “fail fast” a little too literally and fail the build immediately when an error is found.
 * p172 We only stop the commit stage if an error prevents the rest of the stage from running—such as, for example, a compilation error. Otherwise, we run the commit stage to the end and present an aggregated report of all the errors and failures so they can all be fixed at once.
 * Greg: this is saying don't stop running unit tests just because one unit test fails
 * Lars: compilers, with long source code: even if you have a error early it keeps going, it does it all and gives you all errors
 * Greg: talked about this morning: fast vs slow feedback, not unit vs integration in the commit stage
 * Lars: I agree that this stage shouldn't be about integration vs unit tests since the more things you catch at this stage the better
 * JR: finding balance of execution speed, if a test suite takes too long, then people might not run those tests. There's not a difinitive answer of what is "too long". People are different.
 * Lars: UX research gives us some thresholds for attention / response time 100ms break flow of concentration. between 5 seconds (can stop and wait) and 5 minutes. After 5 minutes they'll go for coffee/context switch.
 * p169 Introduction
 * p169 Ideally, a commit stage should take less than five minutes to run, and certainly no more than ten.
 * Zeljko: extensions take long time so folks merge those first before config commit. This morning it took 41 minutes(!!!) to merge an extension
 * https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseLexeme/+/500237/
 * JR: 10 minutes still seems like a long time, especially now that things are faster
 * (Some back and forth on complexity vs. getting more stuff from testing than you once would have.)
 * JR:  Is it just a time thing, or is it what you would expect to get from that time?
 * Lars: My understanding is it's just a time thing, but... Things should be on the order of 5 or 10 seconds, 60 seconds is getting long (for commit stage stuff).
 * Greg: We need good metrics that we watch and keep an eye on.

Artifact repositories:
 * Dan: Thought it was interesting these concepts matched to stuff in Tekton and Argo - discrete inputs and outputs for each stage.  Inputs could be artifact repo or git repo.
 * Lars: I think we should have one.
 * Dan: Yeah we really need a _general_ one (for binary artifacts).
 * (We have the Docker registry and Archiva for Java stuff.)
 * JR: What kind of artifacts do we see?  Besides Docker images?
 * Dan: kask wanted to build a go binary *somewhere*
 * localization cache
 * frontend build step (cf: https://phabricator.wikimedia.org/T199004 - RFC: Add a frontend build step to skins/extensions to our deploy process)
 * https://phabricator.wikimedia.org/T53447 Store CI builds output outside Jenkins (e.g. static storage)


 * p173-174 emphasis on self-serveness of things, sorta.
 * p173, "we consider it a failure if we get to the point where only those specialists can maintain the CI system"


 * p178 Figure 7.3 Test automation pyramid
 * Zeljko: it's not a pyramid, it's a triangle! ;P

Chapter 8: Automated acceptance testing

 * Dan: I'm confused as to what qualifies as acceptance testing.
 * Lars: Acceptance testing is testing for requirements that come from "business".  i.e., the movement in general, readers, the Foundation in general, etc.  This is true whether automated or not - automated is different from user acceptance testing.
 * Dan: smoke tests I think of as system functional tests, not acceptance tests
 * Lars: I felt that this was one of the most important chapters (although terminology is unclear) acceptance tests are user-facing as opposed to tests that developers need.
 * p188 Project managers and customers often think they are too expensive to create and maintain—which indeed, when done badly, they are.
 * Zeljko: pg188 people often think that acceptance tests are expensive to create and maintain, and when done badly they are -- this is what the last years of acceptance tests have taught me as well. Tools that autogenerate tests or inexperienced users create tests that are expensive to maintain
 * Dan: selenium tests at wmf are written by folks who know the code deeply or don't code. Using the UI to test very specific underlying behavior makes for flaky tests
 * Zeljko: when code base grows and they feel that they need tests, so they write selenium tests -- which is easy since it means you don't need to refactor code -- which leads to 100s of selenium tests and few other tests
 * Dan: TDD effected the way that I write code whereas testing via selenium doesn't add any constraints on the development process itself
 * JR: a lot of the perceived value of unit testing is "correctness", but people who write unit tests see that as a secondary effect; i.e., the value of unit tests is that it effects the way you write code. Unit tests are part of the design process.
 * JR: acceptance testing via the UI is difficult to do since the UI is a moving target -- difficult to keep the accpetance tests functioning -- this can sometimes be a lot of effort.
 * Zeljko: antipattern -- different people write test automation than those who write production code -- this jive with what JR mentioned. antipattern: when acceptance tests don't run per commit. All changes pushed to core now run selenium tests currently. PageObject centralizes the page elements in one location (i.e., a page #id change doesn't require a lot of grepping) -- in the book it's ApplicationDriver
 * Lars: agree with 3 previous speakers. TDD entirely revolutionized the way I design my code. Not just that code became more modular, it became more testable. Code that is written to be tested tends to be of a much higher quality than code where tests are written afterwords. antipattern: other people writing tests than those who wrote the code, tests are of worse quality.
 * p191 Such tests are often the output of record-and-playback-style test automation products, which is one of the main reasons automated acceptance tests are perceived as expensive.
 * Zeljko: another antipattern: a recorder that writes tests. Scatters full xPaths all over. Makes for brittle tests. This is an antipattern of people who are not developers writing tests. These tests are brittle and expensive.
 * Dan: lyon hackathon -- pageobject related -- there has to be some degree of coupling between the test code and the application. The book is saying the coupling shouldn't occur on the ui-element level (relates to pageobject etc). We did experiment with having a library that can talk with ooui objects. You can interrogate JavaScript elements to determine behavior of the UI. We could have something that skips the UI entirely and just interrogates those OOUI objects.
 * Zeljko: There was a way to do this using the VisualEditor and the JavaScript api
 * Dan: I think that when you get into assessing visual outputs, i.e., screenshots, like a human would do, that tends to be fragile. Even at a level lower than that  -- on DOM elements -- that's still too much noise. BUT at a level lower than that you're dealing with a thing a developer is familiar with and is more reliable.
 * Zeljko: end-to-end tests is what the user sees so there's some value in that, e.g., if all the text is invisible, but the ooui tests still pass -- I think we still need a few e2e tests, but not a lot
 * JR (from chat): I think the key to these kinds of approaches is having minimal code (no business logic) above the layer you're interacting with.
 * Lars:  To get automated testing, you need collaboration of 3 kinds of people:  User/business owner, developer who writes code / implements tests, and a translator - the tester - who speaks user and developer speak.
 * Brennen: p189: "insufficient time is normally planned to fix the defects found as part of manual acceptance testing" - this rang pretty true.
 * Lars: BTDT, wasn't happy

Next meeting

 * chapters X-Y OR we pause to let everyone catch up?
 * read through chapter 9, discuss at offsite
 * TODO: Greg to find time and invite everyone via calendar

Prepared notes

 * https://files.liw.fi/temp/ch8-9.mdwn

Željko's quotes:


 * 1) Chapter 8 Automated Acceptance Testing

p188 Why Is Automated Acceptance Testing Essential? p188 Project managers and customers often think they are too expensive to create and maintain—which indeed, when done badly, they are.

p190 How to Create Maintainable Acceptance Test Suites p191 Such tests are often the output of record-and-playback-style test automation products, which is one of the main reasons automated acceptance tests are perceived as expensive.

p198 The Application Driver Layer p199 That value is randomized and will change every time the test is run because a new user is created each time. This has two benefits. First, it makes acceptance tests completely independent of each other. Thus you can easily run acceptance tests in parallel without worrying that they will step on each other’s data. p200 One of the consequences of a well-designed application driver layer is improved test reliability.

p204 Implementing Acceptance Tests p204 State in Acceptance Tests p205 The ideal test should be atomic. Having atomic tests means that the order in which they execute does not matter, eliminating a major cause of hard-to-track bugs. It also means that the tests can be run in parallel, which is essential to getting fast feedback once you have an application of any size. An atomic test creates all it needs to execute and then tidies up behind itself, leaving no trace except a record of whether it passed or failed.

p213 The Acceptance Test Stage p213 We know from experience that without excellent automated acceptance test coverage, one of three things happens: Either a lot of time is spent trying to find and fix bugs at the end of the process when you thought you were done, or you spend a great deal of time and money on manual acceptance and regression testing, or you end up releasing poor-quality software.

p214 Keeping Acceptance Tests Green p215 Who Owns Acceptance Tests? p215 In order to keep your acceptance tests working and to maintain the focus of the developers on the behavior of the application, it is important that the acceptance tests are owned and maintained by the delivery team as a whole, not by a separate testing team. p216 It is essential to fix acceptance test breakages as soon as possible, otherwise the suite will deliver no real value. The most important step is to make the failure visible.

p217 Deployment Tests p217 The best acceptance tests are atomic—that is, they create their own start conditions and tidy up at their conclusion.


 * 1) Chapter 9 Testing Nonfunctional Requirements

p234 The Capacity-Testing Environment p236 One obvious strategy to limit the test environment costs and to provide some sensibly accurate performance measures is available where the application is to be deployed into production on a farm of servers, as shown in Figure 9.1. Replicate one slice of the servers, as shown in Figure 9.2, not the whole farm.

Chapter 8: Automated Acceptance Testing

 * p188 Project managers and customers often think they are too expensive to create and maintain—which indeed, when done badly, they are.
 * Our experience here as well.
 * decrease costs by: reducing them and use unit/int instead. And just following best practices (there's a lot of 'smelly' selenium tests)
 * eg: using page_object instead of rolling your own
 * Lars: I agree that writing tests can be very expensive; this is because it's code, not because it's test code specifically.  People write hacky / poor quality code.  You wind up needing a test suite for your test suite, etc.  The solution to this is to write test code that is better than anything else.  The things that affect quality of test code are the same that affect production code.
 * It's possible that we should be putting more effort into review of test code.
 * Ž: There's an anti-pattern of people trying to keep their test code too DRY.
 * JR: A lot of times end-to-end / browser driven stuff is left to people who aren't developers, who don't have a lot of experience writing code. Further, record-and-playback stuff gets used and creates inherently unmaintainable code.  Unmaintainable code is a nail in the coffin over the long term.  Trying to ensure that people writing test code have the necessary knowledge is crucial.
 * Devs are in the best positions to actually write this stuff, with the aid of test engineers to define what it is that needs to be tested.
 * One person can't support 20 devs here.
 * Dan: I actually think pageobject (sp?) is an antipattern itself.  Centered around the structure of the website rather than its behavior.
 * Dan: the interfaces you use in the tests should mirror the interfaces that users use
 * Lars: The point is to write something that provides a useful high-quality (high-level?) abstraction.
 * Jeena: What they're calling the window driver pattern in the book?
 * Lars: probably yes
 * JR: (paraphrasing grotesquely: software design needs to accommodate testing)
 * Dan: An issue these days is that nobody writes HTML any more. Coupling needs to happen at a level of abstraction where humans are working, not at the level of generated code, etc., things like the DOM.
 * Lars: One thing the book raises is test the API under the UI rather than the UI.  No business logic in the UI.  The layer underneath needs to be designed testable.  Reiterating that code needs to be designed to be tested.
 * Adding testing to code after the fact is hard.
 * Lars: Test code should be your best code, but also it should be obviously correct because you don't want to test your test code.
 * Dan: Like overmocking?
 * Lars: If mocking helps your test code be obviously correct, go for it, but I've tended to find that it makes things overly complex.

Chapter 9: Testing non-functional requirements

 * Lars: ch8 more important than ch9 for us, especially right now
 * L: perf and other capacity testing requires very specialized knowledge (considering the hardware shape, how you can extrapolate effectively and when you can't)
 * L: cyclomatic complexity testing, not always useful
 * Z: we're doing some of that with SonarQube, but it always needs a human to review and verify
 * JR: a lot of nonfun testing can be done at smaller scopes, re perf, could have them help with unit/int tests
 * JR: nonfun are not defined to the same extent as functional tests are (the quality of how, not the what it does). less testing and more of a measuring activity.
 * Greg: I see it more as metrics we measure over time, not a hard pass/fail
 * L: yes for Wikimedia, but maybe not for new things, eg making a race car has a real dependency on how fast it is
 * JR: Book: _Mastering the requirements process_ - Suzanne Robertson and James Robertson, about how we create things, examples are physical world things (the london tube)
 * B: lots of notes around the programming for capacity section. "communication across process/network boundaries is expensive and should be minimized", seems like a thing modern development has thrown that out completely.
 * L: hardware has also left that behind: network i/o is faster than disk i/o, especially in datacenters
 * Z: you can just scale it down to have a prod-like env
 * L: never use the word "staging" again

Lars
Chapter 10: Deploying and Releasing Applications


 * automate deployment completely
 * the only difference between envs in what gets deployed should be the config
 * dev and ops need to collaborate - releng too, except our very existence may be an anti-pattern
 * being able to rollback is good; even better if the failed deployment can be kept (inert) for analysis and debugging
 * start planning deployments when project starts - we're late
 * same deployment process for every environment
 * deploy a lot so we get a lot of practise
 * infra, middleware, and application should be deployed using the same process
 * zero-downtime: blue/green, canaries
 * emergency fixes should use the same process

Chapter 11: Managing Infrastructure and Environments


 * a little obsolete about details
 * automate everything
 * keep everything in version control

Chapter 10: Deploying and Releasing Applications

 * Greg: It would have been great if, as an organization, we would have been on the same page about this stuff and had these discussions way back in the day.
 * Jeena: I was confused about the part where they talk about getting permission for every little thing.
 * Lars: Worth pointing out that this was written 10 years ago, and the world has changed - partly because of this book.  There was a time when if you wanted to make anything change in production you went in front of a change review board - or someone with power & experience - and get them on board.  Because obviously you need an expert to approve so that things are safe and you don't break the internet.  It turns out that, apart from places like telecom and banking and medicine, the pre-approval of changes for every little thing doesn't really work very well.
 * Greg: We've kind of grown as an organization - the mentality for deployments used to be kind of like a CAB. More correctly:  The Erik Moeller Change Advisory Person.  Heads of engineering would show up to meetings and talk about what was planned.  Does Erik have issues with it?  The benefit of that meeting was "Does Erik's spidey sense flare up?"
 * We don't really do any of this any more.
 * (Digression about superprotect.)


 * Željko: I'm not sure we have a production-like environment.
 * Tyler: Beta, sorta. Chicken counting.  Useful for MediaWiki deploys.
 * Greg: Beta is not as horrible as it's made out to be.
 * Mukunda: The horrible part is that maintaining it is hard.
 * Tyler: We kind of don't actually maintain beta.
 * Mukunda: A lot is manual that shouldn't be. (This is also true of production databases.)


 * Greg: On a positive note, emergency fixes do go through the same process as any change (except for rollbacks).
 * Željko: During SWAT I've had situations where something breaks and people are sending a second patch. This is pretty much testing in production.
 * Tyler: I was deploying code with people who should've known better; I assumed they knew what they were doing.


 * Tyler: I had the opposite reaction to Greg in reading chapter 10: Underlined a whole lot of things that we already do (well). Automated deployments, rollbacks, canaries.
 * Jeena: Elaborate on automated deployment?
 * Tyler: Was talking about testing - you can set up your deployment on the deployment server using git, syncs to set of a dozen canary servers some of which get prod traffic, and those servers are monitored in 2 ways: Set of smoke tests, and also monitoring of log levels, so that if there's a factor of 10 increase in errors, then we fail deployment.  Those things have worked to stop deployment.  Both automated, announces on IRC why deployment stopped.
 * Željko: Doesn't do the revert.
 * Željko: Do we have a line in our logs to show deployments?
 * Tyler: I don't know - we have them in grafana...
 * (Log discussion - making it clear when deploys happened.)
 * https://grafana.wikimedia.org/d/000000503/varnish-http-errors?refresh=5m&orgId=1


 * Lars: It seems to me from this discussion that deployments could use some work, but really aren't as bad as they could be. I've seen much worse.


 * Željko: "never do late at night, always pair with someone else".
 * We rarely do late-night deploys.
 * ...but we never really pair. There isn't really someone else in the same problem as you.
 * Tyler: With SWATs there's someone else with you, 'cause you're deploying someone else's code.
 * Ž: That's changed in the European window - people are deploying their own code.
 * T: Evening SWAT has definitely made me nervous. There aren't really a lot of people around.
 * Ž: Yeah and the other person is not necessarily the voice of reason - "oh I'll just patch this really quickly".
 * Ž: "every member of the team should know how to deploy"
 * Nobody else but RelEng does the train.
 * Though if they can SWAT a train deployment shouldn't be a big jump.
 * Ž: Deployment script should include tests
 * Ž: p273 - production environments should be completely locked down. Not so much in our case.
 * (Datacenter security discussion.)


 * Ž: p274: "the most crucial part of release planning is assembling representatives from every part of your organization"
 * Lars: I had a different take - basically I think it's saying that we should not exist.
 * Greg: Right - you don't want to making building and releasing a specialized role within the organization.
 * Lars: On the other hand, that all came from a different kind of organization - where the devs are actively developing most of the application that goes into production. There's too many people for them all to be involved in deployment.
 * Tyler: Effectively us existing is "symptomatic of a bad relationship between the ops and development teams"
 * So in conclusion, we're scapegoats.

Chapter 11: Managing Infrastructure and Environments

 * This part of the book is very 10 years ago.
 * Not a lot of notes.
 * Lars: 2 takeaways
 * Keep everything in version control
 * Automate everything
 * Greg: And a good overview of cloud computing and VMs.
 * Larry Ellison: We're already doing cloud computing.
 * RMS: Cloud computing sux.


 * Lars: I would like to quote first sentence of criticisms of cloud computing - "we are convinced that cloud computing will continue to grow".
 * (Federated wiki digressions.)

We are more or less incapable of directly discussing this chapter because it's way out of date but it does lead to some amusing digressions.

Next meeting
Chapter 12 & 13 on the 28th.

Discussion

 * The way a software is structured affects how it's deployed

Chapter 12: Managing Data

 * Software developers are responsible for ensuring there are DB migrations available for their code's needs -- they are the ones that understand the semantics and rules
 * irritated by the hand-waviness of the complexity of forward/backward compatible cost of migrations, but what about multi-terabyte dbs and our scale?
 * we do manual DB migrations
 * MW's existing levels of DB abstraction means it's not horrible (eg: we don't have manual SQL queries; some queries have been carefully written for performance, but that's more secondary)
 * looking forward, we'll have to flag particular versions as blocked on schema-migration, and the reverse: this revision needs at least this schema version
 * how granular should that ^ be?
 * eg: single extension's table, could run in parallel with older/newer general DB; however, core revision table, effects almost all code, pretty much impossible to roll back schema changes
 * liw story time:
 * building a system for 8 years when he joined
 * were diligent in maintaining schema in vcs and forward/backward migrations
 * the DBA/sysadmin refused to use that, would change things under the devs without telling them
 * thus production DB, 500,000 people's personal data, no one other the DBA knew what it contained but it workd "good enough"
 * he liked having job security, paid by the hour (double ugh)
 * all devs had root/dba access to prod DB (we don't), when I was hired it was said they're rewriting it all (language, arch, deploy process) and I said "devs wouldnt' have DB root" and that caused a riot
 * OH GOD clear text passwords saved in the password field!
 * test design
 * isolation of tests from each other
 * mw unit test system sets up fake tables
 * db abstraction layer hard codes _ (underscore) in table prefixes, which means it doesn't work on postgres
 * test data should not be a dump from production
 * common to do, but you don't know what the data is like, so it's better to generate it yourself so you know what you're working with
 * A place for real-world data to show things you won't otherwise see, e.g. performance testing?

Chapter 13: Managing components and dependencies

 * component vs library
 * component what you make
 * library is other people
 * PHP Composer combined with MediaWiki is complicated and a source of friction and toxicity in the project
 * There seem to be no good answers for handling dependencies in PHP
 * "warehouse of cans of worms"
 * dependency solving is a problem we need to solve for moving MW deployment to containers

Next meeting

 * Last two chapters, July 12th