Review action items from previous retrospective:[edit]

Erik: Brainstorm on language-related goal
- DONE. Chose to move forward with Accept-Language headers (Erik) and training a language detector (Trey)
Kevin to take showcase feedback to Adam
- DONE
Oliver to continue email thread about user satisfaction suvey
- DONE
Kevin to email about wiki page "categories"
- Stas added categories; We thought there was an email conversation, but I [Kevin] can't find it right now.

What has happened since the last retro? (2015-11-02)[edit]

Portal shift to gerrit; event logging
Progress on relevance lab
Ongoing hiring processes
Ran multiple cirrus A/B tests
Worked out issues with avro schemas and analytics pipeline
Found a nasty bug in Blazegraph causing data corruption and developed a workaround (so it should stop now)
Improved WDQS GUI significantly (with WMDE team help)
Have monitoring dashboard for WDQS now: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service
Maps are now available for ruwiki's Geohack (GPS links) and Wikivoyage (en & ru)
Dashboard for portal http://discovery.wmflabs.org/portal/

What went well?[edit]

Cirrus A/B testing goes from strength to strength (and we now have analysis redundancy!)
We are 99% of the way there to achieving our primary search goal (https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q2_Goals#Search) ++
David's work with analytics to get the avro pipeline running has been much appreciated
Picking up the completion suggester work again; it was incredibly promising when we last ran tests on it!
Product manager hiring seems to be going extremely well! We have had a lot of really good candidates and interviews so far.
Progress on relevance lab
Maps are live in ru-geohack - thanks to an in-person meetings at a conference, and wikivoyage
Graphs are getting closer to being interactive
ruwiki reported significantly better satisfaction with tech side of WMF - possibly due to substantial participation in the community by Max and Yuri

What could have gone better?[edit]

We didn't get the Survey out (and won't be able to do so usably until next quarter. I don't trust data from late December, simple as) ++
We didn't get the Portal A/B test out (and won't be able to do so usably until next quarter) +++
- NOTE: Follow-up conversations raised the possibility that we might still be able to test this month
Unfortunately, the test we ran for our Q2 goal did not show significant user impact . We are still not showing significant user impact as a result of our search work. ++++++++++
The common terms query A/B test ended up in limbo
- Canceled to focus on quarterly goal tests, was initially reverted by performance issues and once it was worked out we needed to move on.
- We should try to pick it up again in January if we can; it was promising++++
Language detection is hard to do well on short strings. Data gathering for retraining a model is hard. Progress is slow.+
Hard to show impact on inter-language search, the number of queries is just too small (per initial analysis by Trey, and backed up by our prod tests).
All the features for inter-language search have been implemented but we should review Trey analysis and fine tune
ops hiring has moved forward, but no signature on the dotted line yet
Realized we can't analyze the did you mean test results, the test wasn't collecting data properly due to changed css classes+

Discussion[edit]

TOPIC: "Unfortunately, the test we ran for our Q2 goal did not show significant user impact . We are still not showing significant user impact as a result of our search work."

Should we adjust analysis to capture effects within a small subset of all of the searches?
- Probably, but this doesn't explain why we haven't had more impact
- Inability to measure quality of results has hurt us
- Long tail: each change won't have a big impact
- Measure the impact of a change against the population of possibly affected searches
  - Measuring a change that affects a very small number of searches is hard and expensive
Would it make sense to identify "obvious bots"?
- We didn't see substantial improvement even when we did exclude bots
Could shift focus away from ZRR (Zero Results Rate) and toward relevance
Should we split up into microteams, to make progress on more small changes at once?
- Creates team and process problems
Biggest problem is not search results that fail--it is when search results are presented, but the user doesn't click
- UX issues. Maybe split front-end engineers between portal and search results
We would like to keep people searching within our system, rather than bouncing out to other search engines from our content
Are we running A/B tests too soon? Should we do more internal analysis first?
- For language, we knew the effects would probably be small, but it was our Q goal so we moved ahead
For portal page, tests are known to be small (common sense), but getting them out this quarter should be good
- Not a lot of internal discussion was needed, but next quarter would probably make sense
Should we have validation process to make sure we are collecting the data we wanted?
- We do actually have that. The CSS issue was older code (before our validation process).

Action Items[edit]

Dan: write a goal for improving the UX of the search page on-wiki
Dan: Discussion of improving the relevance/sorting of results rather than just zero results rate
Moiz: Talk about whether we really can run A/B tests on the portal, since it's not subject to a deployment freeze
Dan: Follow up on the common terms query A/B test
Mikhail: Look into listing features that affected the results set for a query (sister project to 'query categorizer UDF')