Wikimedia Discovery/Meetings/Search retrospective 2016-03-24

Discovery Search Retrospective

2016-03-24

What has happened?
Covering whatever has happened related to the team since the last retro (2016-02-22)
 * Lila left; Katherine interim ED
 * Released completion suggester
 * Renamed Relevance Forge
 * Added automatic engine param optimization to RelForge
 * Oliver left the team
 * Quarterly planning
 * Annual planning
 * Ran phrase boosting A/B test

Review action items from before

 * Chris? Follow up to Yuri’s idea on status updates for Discovery (ala Wikidata - https://www.wikidata.org/wiki/Wikidata:Status_updates)
 * DONE
 * Chris: we should blog post and announce this (textcat)
 * Updating docs first
 * Chris: we should blog post and announce this +1 (popularity score)
 * Looking into it
 * Chris: Should we publicize usage numbers via blog post or other? (WDQS)
 * Did in weekly update. If should be in other channels, let Chris know
 * Chris: Let’s ask Rachel how much useful it is for other teams ATM +1 (office hours)
 * Has been scheduled!
 * Chris: This should be mentioned on the public mailing list (Nik’s post)
 * Was in weekly update.
 * Dan: Consider adding team information to the Discovery wiki page, but need a maintenance plan (who works on each project)
 * DONE - https://www.mediawiki.org/wiki/Wikimedia_Discovery#The_team
 * Kevin: Consider other note mechanisms (but maybe after we redo retros?)
 * Experimenting with etherpad
 * Tomasz? Should we add more “document” tasks to onboarding docs?
 * Had discussions. No further action required right now

What went well?

 * Completion suggester roll out!
 * Dashboard Graphs moved as a result!
 * Guillaume did not crash wikipedia (yet) despite having the chance to do it multiple times
 * Ran an A/B test using metrics other than the zero result rate (Yay!)
 * Added autocomplete information to satisfaction schema
 * Is the increase in the augmented clickthrough correlated with the completion suggester?
 * Unknown at this time, see "Limited analysis throughput" below.
 * Better ways of evaluating changes to search before deploying them have been developed
 * Upgraded elasticsearch to 1.7.5 (and it was super straightforward because we have Guillaume to chip away at and own these issues!)
 * Talking to Chris about TextCat & generating a public-facing page
 * https://www.mediawiki.org/wiki/TextCat
 * Dashboards were down, Bryan D. fixed the Vagrant issue, so yay for folks outside the department helping us :)
 * bd808 is pretty great at appearing out of nowhere and fixing things

What could we improve?

 * Velocity? Many of our planning meetings lately are basically "we've seen these tickets and they are still there"
 * but maybe the problem is more related to having other tasks that keep getting added/finished and other tasks remaining for multiple weeks?
 * (Dan:) I think this overall concern is correct, but there's also truth to the above bullet point. In one notable example, I added a bunch of tasks to the sprint which were fixed and deployed between two planning meetings!
 * do we have a measurement of velocity somewhere?
 * (Post-meeting note: No, but Phlogiston might help
 * This is partly an effect of kanban vs. scrum, where we are allowing tasks to enter the flow at any time
 * If tasks are taking longer than expected, that may be a warning that tasks should be split into smaller tasks
 * As long as Dan has visibility and agrees with the priorities of what's being done, this doesn't appear to be a problem
 * Limited analysis throughput, due to losing half the team. Solving this is already in progress though.
 * It's possible some other team members could step in and perform less rigorous analysis to help
 * We are back-filling this position, so the issue should just be temporary
 * Reminder to Mikhail to speak up if he feels overwhelmed or like there are unreasonable deadlines
 * Mikhail: We're ok for now
 * Guillaume needs a better understanding of our procurement process
 * RobH seems to be the contact. Guillaume will try to find that documented somewhere
 * A/B test sizing / length of time. We just kind of wing this, but I've been told there are more rigorous ways to decide sample sizes necessary to measure an expected change
 * Relatedly, with a more rigorous sizing we could perhaps run multiple simple tests at same time, like measuring swapping results and measuring the effect of slower results
 * Mikhail: We're unsure about the size of the entire population, so picking a rate is difficult
 * Can we use number of requests as our population (or a proxy for it)?
 * Depends on the test. Who is actually affected by this feature? Requests vs. sessions can be an issue.
 * gEdit returning NSFW results [disputed (that this is a "what could we improve")] +1 to the disputed
 * This seems to be much bigger issue than we can handle in this team.... (depends on the definition of "the problem")
 * (Dan) I basically think this entire issue has kind of always existed, NSFW results and things. We've just brought it to the forefront with the spelling correction.
 * (Dan) I'm more concerned about the fact that someone who types "gedit" almost certainly does not want anything to do with "genitals"
 * Risk of being accused of censorship if we focus too much on not returning NSFW results
 * Dan has a list of queries which have improved; this is one of the few that got worse
 * We were able to provide a workaround solution to the downstream customer very quickly, which was good
 * Two separate issues:
 * NSFW images being presented unexpectedly (in search results, and elsewhere)
 * "Bad" (for some definition of "bad") typo suggestions
 * https://meta.wikimedia.org/wiki/Discovery/Testing looks like it needs a refresh with currently planned/running/done tests
 * Mikhail: Check on this
 * We didn't announce the latest A/B test
 * Should probably have an automatic task to announce each test
 * Timeliness is not critical for most of these tests, so putting in weekly status report

What else should be noted?

 * I've read much less controversy around discovery on our mailing lists (are we improving in communication, or is just the subject getting old?)
 * IMHO, the controversy was a stick people were using to poke Lila / The board into action. Mission accomplished, so it's died down, so to speak?
 * Don't get me started here. :)

Retro of retro
This was the first-ever search-team retrospective, rather than a full Discovery department retrospective. It was an experiment. How was it?
 * Got more in depth in particular topics; more relevant to everyone here
 * Agree
 * but I miss hearing more about what everyone else is doing
 * Mikhail (who was in 3 team retros): I really like these focused retros and ability to go more in depth
 * Compared to previous places: These retros go much less in depth than what I'm used to (e.g. they might go 1 hour deeply into one topic)
 * On other teams at the WMF, we've picked 2-3 issues by voting, and then 15 minutes per topic (not sure which way is better...this kind of works)
 * We could video record these and circulate within the team (allow viewing at 2x)

Action Items

 * Mikhail: Check whether meta Discovery/Testing page is up-to-date
 * Dan? Should probably have an automatic task to announce each test
 * Guillaume talk to robh to understand how procurement works
 * Kevin: Think more about velocity question: Hire more? Change process? Is it OK as is? Start doing guesstimations?
 * Mikhail: Announce the past test(s)