Discovery/Retrospective 2015-11-30

=Review action items from previous retrospective:=
 * Erik: Brainstorm on language-related goal
 * DONE. Chose to move forward with Accept-Language headers (Erik) and training a language detector (Trey)
 * Kevin to take showcase feedback to Adam
 * DONE
 * Oliver to continue email thread about user satisfaction suvey
 * DONE
 * Kevin to email about wiki page "categories"
 * Stas added categories; We thought there was an email conversation, but I [Kevin] can't find it right now.

=What has happened since the last retro? (2015-11-02)=
 * Portal shift to gerrit; event logging
 * Progress on relevance lab
 * Ongoing hiring processes
 * Ran multiple cirrus A/B tests
 * Worked out issues with avro schemas and analytics pipeline
 * Found a nasty bug in Blazegraph causing data corruption and developed a workaround (so it should stop now)
 * Improved WDQS GUI significantly (with WMDE team help)
 * Have monitoring dashboard for WDQS now: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service
 * Maps are now available for ruwiki's Geohack (GPS links) and Wikivoyage (en & ru)
 * Dashboard for portal http://discovery.wmflabs.org/portal/

=What went well?=


 * Cirrus A/B testing goes from strength to strength (and we now have analysis redundancy!)
 * We are 99% of the way there to achieving our primary search goal (https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q2_Goals#Search) ++
 * David's work with analytics to get the avro pipeline running has been much appreciated
 * Picking up the completion suggester work again; it was incredibly promising when we last ran tests on it!
 * Product manager hiring seems to be going extremely well! We have had a lot of really good candidates and interviews so far.
 * Progress on relevance lab
 * Maps are live in ru-geohack - thanks to an in-person meetings at a conference, and wikivoyage
 * Graphs are getting closer to being interactive
 * ruwiki reported significantly better satisfaction with tech side of WMF - possibly due to substantial participation in the community by Max and Yuri

=What could have gone better?=


 * We didn't get the Survey out (and won't be able to do so usably until next quarter. I don't trust data from late December, simple as) ++
 * We didn't get the Portal A/B test out (and won't be able to do so usably until next quarter) +++
 * NOTE: Follow-up conversations raised the possibility that we might still be able to test this month
 * Unfortunately, the test we ran for our Q2 goal did not show significant user impact . We are still not showing significant user impact as a result of our search work. ++++++++++
 * The common terms query A/B test ended up in limbo
 * Canceled to focus on quarterly goal tests, was initially reverted by performance issues and once it was worked out we needed to move on.
 * We should try to pick it up again in January if we can; it was promising++++
 * Language detection is hard to do well on short strings. Data gathering for retraining a model is hard. Progress is slow.+
 * Hard to show impact on inter-language search, the number of queries is just too small (per initial analysis by Trey, and backed up by our prod tests).
 * All the features for inter-language search have been implemented but we should review Trey analysis and fine tune
 * ops hiring has moved forward, but no signature on the dotted line yet
 * Realized we can't analyze the did you mean test results, the test wasn't collecting data properly due to changed css classes+

=Discussion= TOPIC: "Unfortunately, the test we ran for our Q2 goal did not show significant user impact . We are still not showing significant user impact as a result of our search work."
 * Should we adjust analysis to capture effects within a small subset of all of the searches?
 * Probably, but this doesn't explain why we haven't had more impact
 * Inability to measure quality of results has hurt us
 * Long tail: each change won't have a big impact
 * Measure the impact of a change against the population of possibly affected searches
 * Measuring a change that affects a very small number of searches is hard and expensive
 * Would it make sense to identify "obvious bots"?
 * We didn't see substantial improvement even when we did exclude bots
 * Could shift focus away from ZRR (Zero Results Rate) and toward relevance
 * Should we split up into microteams, to make progress on more small changes at once?
 * Creates team and process problems
 * Biggest problem is not search results that fail--it is when search results are presented, but the user doesn't click
 * UX issues. Maybe split front-end engineers between portal and search results
 * We would like to keep people searching within our system, rather than bouncing out to other search engines from our content
 * Are we running A/B tests too soon? Should we do more internal analysis first?
 * For language, we knew the effects would probably be small, but it was our Q goal so we moved ahead
 * For portal page, tests are known to be small (common sense), but getting them out this quarter should be good
 * Not a lot of internal discussion was needed, but next quarter would probably make sense
 * Should we have validation process to make sure we are collecting the data we wanted?
 * We do actually have that. The CSS issue was older code (before our validation process).

=Action Items=
 * Dan: write a goal for improving the UX of the search page on-wiki
 * Dan: Discussion of improving the relevance/sorting of results rather than just zero results rate
 * Moiz: Talk about whether we really can run A/B tests on the portal, since it's not subject to a deployment freeze
 * Dan: Follow up on the common terms query A/B test
 * Mikhail: Look into listing features that affected the results set for a query (sister project to 'query categorizer UDF')