Talk:Team Practices Group/Health check survey

Open questions

 * Where should data be stored - in a simple spreadsheet?
 * Talk to analytics about this [Arthur]
 * How will we generate and publish visualizations of the data?
 * What do we need to do to automate this process?
 * Are there impediments to or strong reasons against making the data publicly visible?
 * Do we do this with every engineering team at the WMF from the beginning?
 * If so, should the TPG have responsibility/accountability for teams we don’t work closely with (eg through either scrummastering or via structured workshops)?
 * Also, I think any set of survey questions should apply to TPG itself: e.g. can TPG say of its "product" that "Releasing is simple, safe, painless & mostly automated"?  Cmcmahon(WMF) (talk) 21:17, 12 August 2014 (UTC)
 * Should this intersect at all with annual performance reviews and if so how?
 * Should we translate the crappy -> awesome scale into numerical values for quantitative measure (eg crappy = 1, awesome = 3 or is it better to use larger numbers/logarithmic/etc)?
 * Talk to analytics [Arthur]

A few thoughts and suggestions
Firstly, I think this is a great start. I like that you're trying to cover the range of topics from very process-oriented to the human side.

Here are a few suggestions based on experience conducting similar kinds of surveys:


 * the 3 point scale is appealing in terms of its simplicity but I suggest a 5 point scale. The additional nuance is useful because one of the objectives of this kind of survey is to spot trends early and then to scrutinize and intervene, if necessary.  Having midpoints between "awesome" and "meh" and between "meh" and "crappy" helps spot those trends early.
 * +1 Manybubbles (talk) 17:54, 12 August 2014 (UTC)


 * I've previously always framed the questions as "On a scale of 1-5 (5 being highest), how strongly do you agree with the following statements". It's good to have a mix of statements formulated in terms of strong agreement being good and strong agreement being bad (to avoid various cognitive biases).
 * +1 on the "how strongly do you agree" language. -1 on switching whether agreement is good or bad.  That tends to frustrate me while taking the survey. Manybubbles (talk) 17:54, 12 August 2014 (UTC)


 * I suggest doing this monthly rather than just quarterly - again, it's about spotting trends before it's too late.


 * I suggest some additional questions on the human side. e.g. (again, statements rated in terms of the degree of agreement):
 * "I feel challenged in my work currently"
 * "I am experiencing discord with one or more of my team members"
 * "My coworkers are acting in a way consistent with Wikimedia values and culture"
 * "Overall, morale on my team is good"
 * "I am frustrated currently"
 * "We have enough people to do what is expected"
 * "I feel proud to work on this project"
 * Why not stick these questions in the table? Manybubbles (talk) 17:54, 12 August 2014 (UTC)


 * Some of these questions are funky to ask about the team as a whole. The open source citizenry one, for example, would get odd answers.  We're active and welcomed with some upstreams and unwelcome with others.  Might make sense to ask the question in terms of min and max for all the projects we deal with.  Like "For the open source project for which we're the best citizens we're actually good citizens." and "For the open source project for which we're the worst citizens we're still good citizens." Manybubbles (talk) 17:54, 12 August 2014 (UTC)


 * Thank you both for the feedback. Your perspectives on the 5-point scale are really useful. I agree with Manybubbles that some of the additional questions posed do not make as much sense in a whole-team context. The idea is that this survey will be given to a team as a whole, and a facilitator will guide the team through coming to consensus on how the whole team feels about each focus area. There are a number of reasons for this, the biggest of which I think is that this helps the team gain a shared understanding of how they are doing, which would otherwise be lost if this were an individualized survey. That said, feel free to add focus areas you think would be useful to the table on the main article :) As for the frequency, I agree that it's important to identify trends as early as possible. However I'm not sure we'd be able to support doing the survey monthly as we (the Team Practices Group) is currently staffed - particularly across all of engineering. Also, this will introduce additional meeting overhead for teams - many (if not all) of which are wary of additional meetings and feel like they already deal with too many of them. Quarterly feels like a nice compromise to me, particularly to start. If teams find this exercise useful and express a desire to do the survey more frequently, we should then figure out how to support that. Arthur Richards (talk) 20:53, 15 August 2014 (UTC)
 * 5-point scale and monthly would be easier if this was an actual survey; but the page describes it as a sort of focus group of each team, which means a) it takes more time, b) people won't feel as free to express themselves. I guess the aim here is to force teams to talk about the general things, identify areas in need of improvement, and work on them bottom-up; as opposed to collecting scores for some later centralised effort to go through. --Nemo 11:17, 19 August 2014 (UTC)

Release Planning (as a focus area)
I think this should be a focus area for teams too. It takes some maturity to have a backlog pointed out and a subset of stories prioritized and slotted into future sprints in order to declare a release date for a major update to a product. Teams could measure their effectiveness at delivering major releases by measuring [actual/expected number of sprints], [actual/expected number of points] and/or [actual/expected number of features]. KLeduc (WMF) (talk) 17:02, 12 August 2014 (UTC)
 * Kevin feel free to stick this into the table :) Arthur Richards (talk) 20:28, 15 August 2014 (UTC)
 * Any such criterion should take into account different teams' deliberate choices in how they do planning and how agile they are. This may even vary for the same team quarter-to-quarter.  For example, we're currently in a more agile mode, iterating based on A/B experiments, but we're considering going in depth on a feature (which would be the kind of thing we likely might plan more up front). Superm401 - Talk 22:04, 18 August 2014 (UTC)

Link to Spotify survey
Can some wonderful helpful person link to the Spotify survey? Manybubbles (talk) 18:05, 12 August 2014 (UTC)
 * Unfortunately not - at least not yet. The folks at Spotify are going to be publishing information about their survey and how they do all of this at some point in the near future; we'll have to wait til then to link to it publicly. Arthur Richards (talk) 20:30, 15 August 2014 (UTC)
 * Manybubbles, the folks at Spotify just published a blog post discussing their survey - 'Squad Health Check model – visualizing what to improve' Awjrichards (WMF) (talk) 00:13, 17 September 2014 (UTC)

Feedback
I don't know how much control teams have over "Easy to release". Mostly my team relies on the standard WMF-wide release procedure (which admittedly, should be more automated, but that's not a focus of our team). An alternative one we're currently working on is having more automated testing, and in general strengthening our continuous integration (this is a sub-point of "Tech quality (code base health)".

The "User/customers" criterion should be rewritten to reflect that we're writing both for people we want to become editors (and users) and people who already are. If we write solely for our existing users, that will not encourage growth. Suggestion:
 * Good: We have a solid grasp on who our current editors and readers are, and how we can reach out to potential new ones. We understand what their needs are, what obstacles they face, and what motivates them.  When we don't know, we conduct research to find out.  We build features that encourage and satisfy them.
 * Bad: We don't know who our editors and readers are, or how we might encourage new ones. We don't know what they need, what obstacles are in their way, or what might motivate them.  Instead of finding out scientifically, we guess.  We have no idea if what we build is encouraging or satisfying anyone.


 * --Superm401 - Talk 22:36, 18 August 2014 (UTC)


 * What makes you think "users" excludes someone? Users of recruitment features are those targeted by them. --Nemo 11:17, 19 August 2014 (UTC)
 * That's a valid way to look at it, as long as it's understood that users does not mean long-time users. It may be someone who's using the registration form or clicking edit for the first time. Superm401 - Talk 21:56, 5 September 2014 (UTC)


 * Superm401 I think that this makes sense if we were only delivering this to feature engineering teams. However, since we hope to ultimately regularly conduct this survey with *all* of the teams in engineering, I think it's important to keep the language a little more generic. For instance, users/customers for the analytics teams will be different than users/customers for the core features team. That said, I think whoever is conducting the survey could certainly add a caveat articulating what you mentioned when delivering the survey to feature teams. Awjrichards (WMF) (talk) 21:08, 9 September 2014 (UTC)


 * There hasn't been any response to my point about "Easy to release"/"Releasing". I think this would make more sense as a product goal for the Release Engineering team, not as a focus area for every team. Superm401 - Talk 22:07, 5 September 2014 (UTC)


 * Superm401 the TPG has given this a lot of thought - we were seriously considering cutting this focus area . However, we came to the conclusion that even though there is a team completely dedicated to release engineerings, different engineering teams still have hugely variant experiences releasing their software. Some teams do not even use the release pipeline managed by the release engineering team (eg mobile apps or analytics in some cases), and even those that do have divergent levels of pain associated with this. As such, we felt that this is an important thing to keep an eye on. Particularly if a team is relying on the release train, but they are still losing a day or two of engineering time due to issues stemming from releasing (eg code freezing, battling bugs, crazy branch merging, etc etc), this is problematic and indicative of the team potentially needing extra support. Long story short, given where we are at now, I think it's important to keep this in the survey - although I'd like to see a day where it is such a non-issue that we can remove it from the survey. Awjrichards (WMF) (talk) 21:08, 9 September 2014 (UTC)

2nd draft of the survey
Kristen Lans and I reviewed the edits to the survey and all of the comments on this talk page, and have taken a stab at putting together a 2nd draft of the survey. We realized that many of the draft 'focus areas' could be conceived of as 'influencers' and/or 'indicators' of some of the higher-level focus areas, so we broke them out as such. The idea here is that these influencers/indicators can serve as points of reference for framing conversation around the higher-level focus areas. Please take a look at the latest version, and of course provide any feedback either in the form or edits to the survey or responses on this talk page.

As a reminder, this survey is intended to be a living-breathing thing - we will iterate on the survey focus area, contents, and methodology as we learn about its effectiveness through delivering the survey to the various WMF engineering teams. We will finalize the focus areas for this first iteration of the survey at midnight UTC on 5 September, 2014. As per the Team Practices Group's annual goals, we will deliver this survey to the teams the Team Practices Group is currently or immediately planning on working with - Mobile Web, Mobile Apps, Mediawiki Core, Analytics Engineering, and Analytics Research - by the end of Q1 2014/2015 (ending Oct 2014). We will take feedback and learnings from that initial delivery and incorporate them into the next iteration of the survey, which will be delivered to the same teams well as a few more (TBD) for Q2 2014/2015. Arthur Richards (talk) 23:06, 28 August 2014 (UTC)

Feedback on 2nd draft

 * Data-Driven indicator misses the mark. It's not about prioritization.  Perhaps we should add prioritization as a distinct influencer/indicator.  Data-Driven is about validating assumptions using data.  Quoting Toby:
 * 1) Identify intended benefit from change and how it will be measured
 * 2) Ensure that metric is measurable and accurate
 * 3) Perform A/B test (or similar experiment) to see if change has desired impact and no side effects
 * 4) Based on results from #3, either roll-out or discard change
 * So the awesome example would go like: "We document our expected quantified impact and measure before and after release." Crappy example goes like this: "We have no idea how many people we are impacting or what we are assuming around our impact." KLeduc (WMF) (talk) 19:49, 29 August 2014 (UTC)


 * KLeduc (WMF) feel free to update the influencers/indicators to include this.


 * Code Quality does have some indicators. In gerrit, you could look at how much feedback and changes are made on a patch.  Perhaps you could also count the number of +2s.  However I do think the quality will be subjective and each team will have different expectations (the same way pointing stories is different from team to team). KLeduc (WMF) (talk) 19:49, 29 August 2014 (UTC)
 * Almost no changes have more than one +2. I don't think counting +2 or +1's is a valid indication of code quality. Superm401 - Talk 21:54, 5 September 2014 (UTC)


 * If anything, I think verifiable bugs (rather than enhancements/feature requests) would be a better candidate as an indicator for code quality. Feedback/changes on a patchset can be indicative of so many different things - and different individuals/teams have such varying workflows in gerrit (committing works in progress and iterating based on feedback, for instance) that I do not think this pulling data from gerrit can be a reliable indicator. Awjrichards (WMF) (talk) 21:13, 9 September 2014 (UTC)

More feedback
I expected the indicators to be less fuzzy, like KLeduc's example above of counting gerrit +2s. Here are some other "hard" indicators:


 * Releasing
 * Team has Makefile or gruntfile automation producing all build artifacts.


 * Code quality
 * Jenkins runs voting PHP code sniffer and qunit test jobs before gerrit patches are merged. Automation generates useful sample code and documentation.  The count of TODO/fixme/XXX comments in code are going down, not up.


 * Learning
 * Every developer is able to review any other developer's gerrit patches; any developer could fill in for another developer.

-- S Page (WMF) (talk) 08:57, 4 September 2014 (UTC)

S Page (WMF), the influencers/indicators are fuzzy on purpose, and they may not be applicable to all teams all of the time. We can can certainly add hard indicators/influencers too as examples of possible things to look at when evaluating a particular focus area, but we should not use them prescriptively (eg saying if a team doesn't use makefile or gruntfile automation for producing build artifacts, then they are doing poorly in the area of releasing). Different engineering teams work in different ways with different technologies - as such this survey is intended to be a little more meta, a little more generic. Long story short, feel free to add influencers/indicators to the list on the page if you'd like to see them included, but keep in my mind that they are not intended to be at all prescriptive :) Awjrichards (WMF) (talk) 21:18, 9 September 2014 (UTC)