Wikimedia Engineering Productivity Team/Read papers and talk/2020-11-23

Paper
Predicting Faults From Cached History https://people.csail.mit.edu/hunkim/images/3/37/Papers_kim_2007_bugcache.pdf

Tyler's presentation
https://docs.google.com/presentation/d/1THFsY5ca0yQqo3bTY266C20yP0Em1aEMIcC5zVxe7mE/edit#

Discussion

 * Gilles: considers changes as uniform wrt time; re: google, you might not have that context if you're a volunteer vs working at google; I like the distance and coupling between files -- this is information we could present to others. We could extend that to "Depends-On" headers. Expose as a comment, "usually when people make a change to this repo, they change this repo" It all depends on how it's surfaced. I think the naive version from the paper might be noise for people.
 * Ahmon: if you're working on frequently buggy files -- what's the action for this notification
 * Gilles: maybe you need more than one reviewer if this file has been identified as buggy
 * Greg: If something is tagged as buggy then get a second set of eyes. I do think this is something that is meaningful for deployers. It could be useful to check areas of code known to buggy when reviewing logspam. +1
 * Gilles: if you could code logs as "most likely to cause bugs" that would be useful
 * Brennen: re:unknown files it could be a proxy for files that only a few people work on which might be an interesting avenue
 * Ahmon: similar heuristics applied to the developer
 * Tyler: *facebook ranting*
 * Greg: I was at a presentation that Chuck gave. You start at 5points, and you lose points if you introduce bugs, if you get to 2 points we talk to your manager. Example: Working on old code introduced bugs, which is indicative of problems, they needed help
 * Gilles: maybe if there's some way this is locked away, eg. differential privacy storage that affects the commit risk score
 * Greg: we have this model in our heads
 * Brennen: I have a mental model of whose work is going to make train hard. this is not a reflection of anyone's competence, but a reflection of what problems are being worked on
 * Lars: warehouse system at a previous job tracked people using forklifts. The employer wanted to fire people based on this information. The labor union was opposed. They released aggregate statistics and released info to individual drivers. People want to do good work so they work on their shortcomings.
 * Zeljko: bugfix is a likely locus of errors. There's some disconnect in the "why" of a bug. Why do developers create bugs? I like the way they said it.
 * Ahmon: +1
 * Tyler: this could identify people to have on hand during a deploy. eg: Here are the changes likely to cause errors, here are the authors, subscribe them to the deployment blocker task.
 * Brennen: if you wind up just cc'ing Timo on everything that's not helpful. Maybe just a widget that deployers see.
 * Lars: invite them to be on IRC while doing the train.
 * Tyler: try this and see if it matches reality
 * Greg: could list this on the task rather than subscribe people. List patches that have touched code that's buggy. And we can always backtest it for past trains: see if it would have identified people that actually did need to be involved.
 * Gilles: ratio of how often it's linked to bugs vs time it gets commits. Model that might work better in a codebase where work across files is very uneven
 * Brennen: given that a patchset linked to a task might be a feature not a bug is "Bug" still relevant
 * Ahmon: we mark production errors with a tag
 * Greg: there are tasks types, but not universally used
 * Greg: Does our expectation of how useful this is change in a world where we do deployments more frequently than once per week
 * Tyler: the outcome from google is that they (devs) already know the risky files. It needs a way to mark it as ack'd to not get the alert again. It was useful at first, then became noise.
 * Brennen: I think it'd become boilerplate.
 * Ahmon: this is useful during a problem, not necessarily before deploy. (here's a problem, here's some places to look)
 * Tyler: FB has a thing called Sandcastle to tests efficiently based on code changes. They run E2E tests post-merge and do a bisect if the tests don't pass.
 * Greg: sandcastle -- AI run the tests that re needed based on the change (s/AI/regexes/)
 * Elena: Zeljko was working to find out which repos were causing problems
 * Zeljko: problematic repos: ops/puppet and mw/config. I think WMDE repos, wikibase, were in there based on analysis of incident reports. We don't have a lot of data and that data was hard to get. I was correlating phabricator tasks and wiki pages. I didn't dig into file or directory-level or people-level, just repo level.
 * Zeljko: it was in the paper that they re-ran the algo with different parameters to find the optimal for each repo. Spacial distance calculation was brilliant.

Potential sources of data:
 * backports to deploy branches
 * production error tasks