User:Spetrea

So you were on some sort of journey on this small ship and the storm broke your sails, the lightning struck your deck, it caught on fire, the waves rocked the ship and the compass broke on the floor. Look at you, you're in the middle of a tempest. Here's a logbook, and some maps and a compass. Maybe maybe, maybe not.

I'm working on the following at the moment:


 * working on Hive queries for kraken
 * wikimetrics project, did some work on the survivor/threshold metrics and asynchronous cohort validation.
 * wikistats project ( gerrit repository, github mirror)
 * debianize scripts -- helper scripts to make debianization easier (discontinued in favor of git buildpackage)
 * webstatscollector
 * udp-filter
 * dClass device detection debian package and jni extension (was released, but the original author created support for JNI as well, The libdclass-dev.deb we created will be available indefinitely)
 * uaparser perl port of the ua-parser project
 * XS Perl extension to parse fields (this was specifically built for the new mobile pageviews reports, was discontinued due to lack of time)

If you have any [mailto:stefan@garage-coding.com questions or requests contact me at this e-mail].

2013-December-02
Almost forgot about this page for a while.

Writing a Hive UDF to count percentages of browser vendors and versions. It's working out well so far. Hive syntax has differences from that of MySQL. Apparently for each additional thing to a query, Hive creates a new job. For example, you want a sub-query, a new job, you want to add a LIMIT clause.. a new job, you want to INNER JOIN, a new job.

So a simple SELECT stuff FROM table LIMIT 10; would take around 40 seconds because of overhead(loading up hive, adding jars for serde/udfs, switching database, telling hive how to find some udfs you'll use in queries).

I suspected this would turn out to be a problem. Now I have a query with a couple of joins and stuff, and I apparently managed to hit ~2m30s running time with a rather simple query. It's still ok I guess. Actually it's not ok... it depends how you view this. If you're like "But MySQL would've taken 0.001 seconds to run that", then you'd be right. I'm actually still trying to cope with this overhead that Hive forces upon you while developing a query or a udf for it. Apparently there's no way to avoid it. If you have any thoughts on this, I'd be very happy to hear from you.

I have my own hand-rolled Hadoop/Hive cluster set up on one of my machines over here. Setting Hadoop/Hive is a pain, but at least I get full control over them and get to learn some puppet in the process. I am thinking about converting all this setup to puppet-cdh4.

2013-April-10
It's been a while since I've updated this page.

500M bump was solved.

This is because we agreed to use mingle instead. So right now I'm working on card-60 and card-551.

The card 551 is focused at documentation, while the card 60 is focused on Mobile paveviews reports.

The documentation is also available under mediawiki format.

It's currently being generated automatically in 3 formats:


 * HTML
 * LaTeX (so it's available in pdf format)
 * POD
 * MediaWiki

2013-March-07
Attempted to fix 500M bump:


 * Modified code for new mobile pageviews to have the same logic in embr_py code. 500M bump still not solved


 * Modified code to use the same languages as embr_py. 500M bump still not solved


 * Used /a/squid/archive/sample instead of /a/squid/archive/sample-geocoded. 500M bump still not solved


 * dropped XS module completely, ported to Perl code, simplified a lot. 500M bump still not solved


 * sequential run(previous ones were parallel) to assure that the bump isn't related to the parallel model 500M bump still not solved

2013-March-06
Fixed a segfault in webstatscollector time_travel branch

2013-February-22
Today work will be done on the 500M bump.

An issue was made to treat backporting in debianize. Not sure what priority this has. I hope to find out more about this in the standup.

2013-February-21
Polished some changes to webstatscollector time_travel branch. Gerrit got in the way, not letting me put a new patchset to the already existing review. But I managed to get the changes in through another changeset.

This change is replacing spaces with underscores for the page titles in pagecounts output of filter in webstatscollector.

Evan Rosen published his research on the way Maxmind's different versions of its database affects the page views statistics over time.

2013-February-20
Provided support for building dClass on OSX, added detailed instructions in the README here.

2013-February-17
Erik provided some very useful details about the logic of wikistats. We'll use them in the new mobile pageviews report. Maybe we can find a way to overcome the 500M bump.

Evan provided this diagram which is the way he is counting pageviews. This will be implemented and compared with the logic in the new mobile pageviews reports.

New features in git2deblogs.pl the debianize:


 * --update implemented
 * tests for --update
 * travis CI integration by writing the .travis.yml file

All tests pass, debianize is green.

All maxmind databases for city/country/etc for all months and years are now packed into a big 11G zip file on stat1:/home/spetrea/maxmind_archive.zip

2013-February-15
More work on Limn debianization. A new user is now created when installing the limn package.

The package is installing limn in /usr/lib/limn

It is also creating a user called limn with a home directory /var/lib/limn.

2013-February-14
Evan Rosen asked that we collaborate in finding a unified definition of a mobile pageview. This page was created and it will be updated to reflect that definition "mobile pageview".

The graph inside is created using graphviz(more info on the dot syntax can be found here). If anyone is reading this and would like to contribute, please contact someone from the Analytics team.

2013-February-13
Taking some days off next week(not now).

Started working on debian package for Limn.

The 500M bump
This is the biggest problem we currently have with the new mobile pageviews reports.

We tried multiple approaches so far:


 * we tried eliminating all /w/api.php requests with action=opensearch
 * we tried disabling bots discarding
 * we tried removing /wiki/Special: urls
 * we tried to disable our 20x/30x check
 * we tried checking the mimetype density of processed requests in december 2012 before and after 14dec.

What we found so far is that:


 * the problem is most likely limited to the /wiki/ urls
 * the problem is limited to the urls with mimetype "-"

What we want to do next:

Classify the mimetype "-" requests with /wiki/ urls by


 * ip
 * url

This is part of a drilldown process so we can find out some features these requests in the 500M bump share together.

2013-February-12
New UI changes for Mobile reports.
 * legend has all the details about all the numbers in the table cell.
 * breakdown and discarded piecharts are now optional (selected through a checkbox)
 * the wiki and api counts are now optional (selected through a checkbox)

A sample was created with processed entries with mimetype "-" for days 1-14dec and 15-31dec.

Top 200 referer domains in december 2012 was generated.

Top 200 referer domains in december 2012 after 14dec.

Top 200 referer domains in december 2012 before 14dec.

New mobile pageviews reports
We have to drill down in the data so we'll make histograms with the mimetypes 1-14december and 15-31december 2012 to find out which of the mimetypes have a bigger share of the total.

Mimetype density chart made for 1-14dec, 15-31dec 2012.

Maxmind's database and the IP block re-allocation
The maxmind database changes as blocks of IPs are re-assigned regularly. We are using Maxmind's database indirectly through udp-filter. There is an archive of all maxmind databases to which we have access to. Udp-filter and any program who does geolocation should take into account the date of the log entry when the geolocation is done.

A solution to this would be to load all the maxmind databases in memory when doing geolocation and depending on the time of the log file to use the appropriate database.

This also applies to bot detection. We currently have in wikistats code that relies on various IP ranges. These IP ranges change across time. I'm not aware of a list of Google, Bing, Yahoo bot IP ranges across time (but it would be very helpful if we could find one)

The problem with Maxmind's geoip database is directly related to the country reports. Because the blocks get re-allocated the counts will be affected from one month to the other.

Ideally we should use different maxmind dbs for different time intervals.

What I'm currently working on
The main areas of focus are:


 * Country report( requested by Amit Kapoor )
 * New Mobile pageviews report (requested by the Mobile Team, in particular Tomasz Finc)
 * Solving bugs in wikistats (the bugs present in Asana requested by Erik Zachte )
 * Limn debianization
 * Device detection through the dClass library (requested by the Mobile Team)

New mobile pageviews reports
It now takes ~2h to generate reports for an entire year of data. And we can write tests for them as well because the functionality is split into classes. In this particular case, we have added map-reduce logic which can crunch the data in parallel. We also use templating to separate html/js code and rendering-specific code from the rest of the code.

Currently we're experiencing some difficulties with the months November and December 2012 and onwards because the API for mobile has changed and there are multiple requests per pageview. The vast majority of these requests are in /wiki/ requests as can be seen in revision 25 of the report here.