Jump to navigation Jump to search
|This page is obsolete. It is kept for historical interest only. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date.|
This page is archived! Find up-to-date documentation at https://wikitech.wikimedia.org/wiki/Analytics
Plan to collect session level metrics for regular, beta, alpha mobile site
Next steps for the data collection phase
- Start storing cookies so we can identify the alpha and beta site, this means we need to make a change to the logging format of the varnish server. This would also be the first time that we start storing cookies on the server-side so we will need to loop Legal in and get clearance. I have started a conversation with Legal already.
- We need to migrate from the space as field delimiter to the tab character as delimiter so we will have a consistent number of fields per log line (another change to the logging format of varnish). If the number of fields per logline is variable then that significantly complicates our lives during the analysis phase (this is the current situation). We have the patchsets ready and are aiming to deploy on February 1st. This will require support from Ops. We alo will need to patch some parts of our legacy infrastructure (webstatscollector) and update the udp2log filters.
- Fix 'not reading fast enough from incoming socket' for mobile traffic when storing data in Kraken. By not being able to read fast enough, we are dropping loglines and hence we have incomplete webtraffic data. Our current proposal to fix this is to make the mobile varnish servers send traffic to a separate udp2log instance; this would significantly reduce the traffic we need to consume. Instead of consuming traffic from all cache servers and filter for mobile, we listen to a separate port with only mobile traffic. This challenge is the one big unknown that can cause delays. The fallback strategy would be to use sampled mobile log data but that would mean we cannot do analysis per session, we would have to settle with raw counts per day.
- Append deduplication identifier in querystring to prevent double counting pageview requests. Brion already kicked off a discussion on a separate thread about different ways of implementing this. Hopefully the mobile team can quickly reach consensus on the solution and implement this.
Next steps for the analysis phase
- Debianize dClass for device detection
- Write Pig dClass function
- Write Pig Sessionize function
- Write Oozie job to automatically schedule conducting the analysis on a recurring basis (hourly / daily / weekly / etc).