Wikimedia Discovery/Meetings/Analysis communication 2016-05-10

From mediawiki.org

Discovery Analysis - Communicating changes

2016-05-10

Dan, Tomasz, Mikhail, Jan, Yuri, Erik

We have had some issues where our data was incorrect or misleading. 

The goal of this meeting is to discuss these issues. 

What can we change or augment our workflows to minimize these issues?

Can understand Mikhail's frustration in discovering these issues.

Ideally, loop Mikhail in as soon as we can.

Yuri: We should have multiple ways to look at the data. If we could have a parallel source of data via grafana, as a completely different alternative, maybe possibly we could catch things like that

This is different. What triggered this meeting was the portal clickthrough rate which was being presented incorrectly for a long time.

The sessions had a 15-minute expiration time (so could go on indefinitely), but that wasn't documented anywhere.

Reviewing the schema docs didn't help, because they didn't include this detail. 

How did this decision come about? Design/front-end decided verbally, but it didn't get documented. (Oliver might have been there too.)

That's certainly a documentation issue. Where would be the right place to document that?

It's in the code, but that wouldn't help non-coders.

I see you put it in the schema, which I suppose is the right place, even though it's very detailed. 

I think we should have documentation refer to the code more. They should be less separate. 

Docs should point to git repos. Point to specific lines of code. This would be more up-to-date. 

Click on a link in the doc and it would show you the specific code. 

Allows you to check at any time whether the docs are up to date with the code. 

Docs should be more like a table of contents to point to the code (but with more details than the comments provide).

Analytics is hard. We have had similar issues in search. 

The problem is often that the search engineers don't know the answer until Mikhail asks and the devs go look at the code. 

Search is different because the portal was a conscious decision that wasn't documented. 

It was great to see the demo thingy that Erik created where we could view the events. It was really cool. 

That is still unmerged in gerrit. 

Should analysis team be asked to +1 code changes that affect analytics?

Should developers review and +1 analysis code to make sure it aligns well with the actual code.

We don't want to go overboard with additional processes, but we want to improve this.

With new data analyst on the way, how can we get them onboarded and successful?

That will be hard. Having the data sets (mysql + hadoop) as documented as possible definitely helps. 

Would it be possible to have checkpoints where simple processes could help avoid problems? Not whole new workflows, ideally. 

I like the idea of having Mikhail review the code. 

With the portal, I run regression tests to make sure we still output the same numbers after changes have been made. 

I'm not sure where a new analysis person would look to know what the data would look like. 

We should probably set up a wiki page with all the details that would be relevant to analytics. 

That's a good idea. I wouldn't want to possibly insult them by saying "we didn't think you could understand the data itself, so we set up this extra doc"

But I think we could work with that. 

It's about the little things we overlook. 

Yes, we had a doc issue. That's a concern. In accounting, you balance debits and credits, to guarantee detecting errors--self checking. 

In analytics, there are so many ways to screw up. It's not that we're bad at it. It's hard. 

Is it possible for us to have 2 different ways to arrive at the same results. That should catch doc bugs, implementation bugs. 

Are you talking about 2 analysts performing the same work in parallel?

No. That would be awesome but impractical. Just looking at data from different sources, like grafana. 

I can view hits from varnish, and on the dashboard I can see the analytics numbers. 

The are roughly the same, so I can be comfortable that there are no glaring errors. Not exactly, but rough. 

Are all the wrinkles (e.g. 15-minute window, javascript) exposed through both paths? (Or most of them?)

In some cases, the queries would be the same, so it wouldn't really be a separate channel. 

As an example, zero results rate could be determined via both back end and front end. 

In theory, they would be identical. In practice, they should be nearly identical. 

That would require extra work, but so would documentation. 

And it's probably not possible for everything, but possible for most things. 

Yes, possible for most things. 

ZRR is coming from request logs. Page stats also have that, but only for desktop. 

But one of the channels is more likely to have bot traffic. 

I like that idea. It's something we should look into. We need to make sure all 3 mobile platforms are represented.

Mobile schemas should already include that information.

Main takeaways:

  • Improve documentation
  • Where possible, record 2 sources of data to verify each other
  • Increase awareness of communication issues
  • Look into moving Erik's demo thingy into productionA lack of awareness was the root, so even if we have multiple data sources, we might not talk more, so might still have issues. 

Tools are unlikely to solve all the problems.