User:Spetrea

Maybe you'll find something of use here.

I keep this logbook/diary/journal to remember what to tell people when they ask me what I'm working on.



Stuff I've been fiddling with:


 * working on Hive queries for kraken
 * wikimetrics project, did some work on the survivor/threshold metrics and asynchronous cohort validation.
 * wikistats project ( gerrit repository, github mirror)
 * debianize scripts -- helper scripts to make debianization easier (discontinued in favor of git buildpackage)
 * webstatscollector
 * udp-filter
 * dClass device detection debian package and jni extension (was released, but the original author created support for JNI as well, The libdclass-dev.deb we created will be available indefinitely)
 * uaparser perl port of the ua-parser project
 * XS Perl extension to parse fields (this was specifically built for the new mobile pageviews reports, was discontinued due to lack of time)

If you have any questions, well ... hopefuly you won't have any questions (this page is created for the purpose of answering those questions).

You may find inactive links here(as I intend to write here for an extended period of time), and you may try archive.org for those inactive links.

HDD death & recovery
Got some 16GB of memory for the machine that runs the cluster. Bought an 64GB SSD as well. As soon as I was able to plug the SSD in, the HDD failed. So, hard-disk failure on that machine. New hardware is on the way, should arrive in the next 3h. (have to do some data recovery as well). Unexpected and weird as that HDD was from 2011 (2 years is a long HDD lifespan ? read some stuff on forums about some of these Seagates failing after 1-2 months, maybe it's more about luck). Apparently my usb stick is nowhere to be found, so I had to get one of those as well to do a Ubuntu reinstall on the desktop machine. Any disk-intensive operation causes all processes on the machine to segfault, then some memory corruption occurs and I'm forced to reboot it.

Found a way to do an integrity check for the b0rked HDD.

sudo smartctl -a /dev/sda | less

This doesn't output much (nothing).

Moreover, the failures started happening exactly after I tried to move the Virtualbox .vbox files for the VMs from the HDD => to the SSD.

Update: Did some trial & error, took out disks, put them back in, took out new RAM, put old RAM back in, tried and tried. Turns out the new memory is the problem. The memory is not compatible with the chipset of the motherboard. Nobody told me that. Have to solve it tommorow.

A small dream
If I had infinite time in each day, one of those days I'd write a crawler to get all computer parts specs, throw them in tables in a DB. Then I'd be able to run queries like this one:

SELECT motherboard.model, cpu.model, ram.model, gfxcard.model, hdd.model, ssd.model FROM motherboard JOIN cpu ON (motherboard.cpusocket = cpu.socket AND motherboard.chipset = cpu.chipset) JOIN motherboard ON motherboard.gfxport = gfxcard.port JOIN ram_mobo_compatibility ON (ram_mobo_compatibility.mobo = motherboard.model AND ram_mobo_compatibility.ram = ram.model ) [...];

Then I'd have no problem, everything would be awesome, I could just make a query, find what I need to get, get it, and it would just work and I could just work on what I have to work on and everyone would be happy and the sun would shine on the sky even in wintertime and foggy days. And I could also get what I'd need at the lowest price too. Except I don't have that time, and I have to do it by hand

2013-December-10
Made some progress on the UDF. Still getting some errors, have to figure them out.

Some background on ways of extending Hive
There are 3 ways of extending Hive's query language:


 * User-Defined Functions (UDF)
 * User-Defined Aggregation Functions (UDAF)
 * User-Defined Table Functions(UDTF)

For the purpose of card 1227 we're interested in UDFs.

UDFs should not be confused with SerDes (Serializer/Deserializer). They are both ways of extending the query language but while SerDes seem to be concerned more with serializing/deserializing data(for example the JSON SerDe), UDFs are more geared towards processing/transformations/extraction logic on one or more columns inside a table.

There are at least two types of UDFs:


 * hive.ql.exec.UDF
 * hive.ql.udf.generic.GenericUDF

When developing an UDF one extends one of those two classes and implements various methods for them. For plain UDFs it's much easier than for GenericUDFs. There's a lot more specific datatypes that need to be used for GenericUDFs as this blog post shows. It's also more involved as you have to implement more details of the UDF. For generic UDFs apparently hive.ql.udf.generic.GenericUDF the official documentation is more scarce than for plain UDFs. There are of course other forms of documentation, such as.. reading generic UDFs that other people implemented(or blog posts) and trying to deduce how they work or what is necessary in order to implement one.

For the simple UDFs the documentation is here 1. Apparently the GenericUDFs are used when the input/output is more complex. That's the thought-trajectory/motivation behind reading about GenericUDFs.

More to come soon.

2013-December-04
Read documentation on UDFs

My Standup update
Extend the UDF to get multiple values out of it. The following things will be accessible:


 * minor browser version
 * major browser version
 * browser vendor

Then I can write queries for all the 3 required groupings.

I'm currently reading up on UDFs and GenericUDFs about this. Now the UDF just outputs one text value(the vendor) so I have to extend it to return a Map instead.

Maybe I'll get some time to write some tests as well for this.

Sample output from the Mobile team
This particular section contains sample output for card 1227. This was provided to us by the mobile team. So it's roughly what they would expect.

Vendor table
Mobile Safari 35.87% (4064512) Android 25.14% (2848976) Other 9.45% (1070491) Chrome Mobile 6.35% (719102) Opera Mini 4.55% (515398) total 100.0% (11331934)

Report major version
Mobile Safari 6 29.8% (3377452) Android 4 13.84% (1568531) Android 2 11.2% (1268731) Other unknown 9.45% (1070491)

Report minor version
Mobile Safari 6.0 27.3% (3093981) Android 2.3 9.79% (1108994) Other unknown 9.45% (1070491) Android 4.1 7.27% (823372) Android 4.0 6.2% (702782) Googlebot 2.1 4.25% (481245) Chrome Mobile 18.0

Updated Onboarding page
Update parts of the onboarding page for the analytics team with details on hardware, useful software and mailing lists.

Hive query
Here's some sample output for the query I'm working on(before I added the percentages column): Windows 7,39079 iOS,13776 Windows XP,12781 Android,12006 Mac OS X,6659 Windows 8,4546 Other,3950 Windows Vista,2713 Windows,1209 Linux,839

It's yet to be decided whether we'll use dClass or ua-parser in the final query for this. The output and current implementation are using the java implementation of ua-parser.

Personally I'd like to implement UDFs and queries for both dClass and ua-parser to be able to compare them and have enough information to take an informed decision on which to use.

Some observations about Hive and the cluster I'm playing with
Almost forgot about this page for a while.

So a simple SELECT stuff FROM table LIMIT 10; would take around 40 seconds because of overhead(loading up hive, adding jars for serde/udfs, switching database, telling hive how to find some udfs you'll use in queries).

Writing a Hive UDF to count percentages of browser vendors and versions. It's working out well so far. Hive syntax has differences from that of MySQL. Apparently for each additional thing to a query, Hive creates a new job. For example, you want a sub-query, a new job, you want to add a LIMIT clause.. a new job, you want to INNER JOIN, a new job. All these jobs have a 20-30second overhead, so it all adds up.

I suspected this would turn out to be a problem. Now I have a query with a couple of joins and stuff, and I apparently managed to hit ~2m30s running time with a rather simple query. It's still ok I guess. Actually it's not ok... it depends how you view this. If you compare it to mysql, yeah, it has a lousy overhead. I'm actually still trying to cope with this overhead that Hive forces upon you while developing a query or a udf for it. Apparently there's no way around it. If you have any thoughts on this, I'd be very happy [mailto:i-get-so-much-email-I-can-barely-go@through-it(sarcasm).com to hear from you].

I have my own hand-rolled Hadoop/Hive cluster set up on one of my machines over here. Setting Hadoop/Hive is a pain, but at least I get full control over them and get to learn some puppet and vagrant in the process.

I unsuccesfuly tried to manually install hue. Got it running, logged in with some user I inserted manually through its sqlite user db(had to forward hue ports from master to dev laptop), then tried to run some hive queries on hue and apparently it just hangs. So much for my dream of using hue. Anyway, it seems it had some problems with the hiveserver2 configuration to which hue talks to. Hue being what it is, namely, a Python web app that talks to hiveserver2 through the thrift protocol there's lots of xml and configuration, I probably messed up some configurations. Certainly spent lots of time reading forums and mailing lists for it, but.. to no avail.

I am thinking about converting all this setup to puppet-cdh4 which has hue and all the stuff already packaged.

This is a simple ascii diagram of how I run this home-grown hadoop setup. Of course, it's not very detailed or useful(maybe later). I made the diagram using App::Asciio (you can also get it from the asciio package).

.--.                          | desktop                          | | machine                         | |                                 |                           |                                  |                           |    master   backup               | |    ____     ____                |                           |    |    |   |    |               |                           |    |____|   |____|               |   dev                     |    /::::/   /::::/               | laptop                  |       ^                          | ____                    |       |                          | |    |  --'                          | |____|     ssh & sshfs    |                                  | /::::/                   |    hadoop1  hadoop2  hadoop3     | |    ____     ____     ____       |                           |    |    |   |    |   |    |      |                           |    |____|   |____|   |____|      |                           |    /::::/   /::::/   /::::/      |                           |                                  |                           '--'

2013-April-10
It's been a while since I've updated this page.

500M bump was solved.

This is because we agreed to use mingle instead. So right now I'm working on card-60 and card-551.

The card 551 is focused at documentation, while the card 60 is focused on Mobile paveviews reports.

The documentation is also available under mediawiki format.

It's currently being generated automatically in 3 formats:


 * HTML
 * LaTeX (so it's available in pdf format)
 * POD
 * MediaWiki

2013-March-07
Attempted to fix 500M bump:


 * Modified code for new mobile pageviews to have the same logic in embr_py code. 500M bump still not solved


 * Modified code to use the same languages as embr_py. 500M bump still not solved


 * Used /a/squid/archive/sample instead of /a/squid/archive/sample-geocoded. 500M bump still not solved


 * dropped XS module completely, ported to Perl code, simplified a lot. 500M bump still not solved


 * sequential run(previous ones were parallel) to assure that the bump isn't related to the parallel model 500M bump still not solved

2013-March-06
Fixed a segfault in webstatscollector time_travel branch

2013-February-22
Today work will be done on the 500M bump.

An issue was made to treat backporting in debianize. Not sure what priority this has. I hope to find out more about this in the standup.

2013-February-21
Polished some changes to webstatscollector time_travel branch. Gerrit got in the way, not letting me put a new patchset to the already existing review. But I managed to get the changes in through another changeset.

This change is replacing spaces with underscores for the page titles in pagecounts output of filter in webstatscollector.

Evan Rosen published his research on the way Maxmind's different versions of its database affects the page views statistics over time.

2013-February-20
Provided support for building dClass on OSX, added detailed instructions in the README here.

2013-February-17
Erik provided some very useful details about the logic of wikistats. We'll use them in the new mobile pageviews report. Maybe we can find a way to overcome the 500M bump.

Evan provided this diagram which is the way he is counting pageviews. This will be implemented and compared with the logic in the new mobile pageviews reports.

New features in git2deblogs.pl the debianize:


 * --update implemented
 * tests for --update
 * travis CI integration by writing the .travis.yml file

All tests pass, debianize is green.

All maxmind databases for city/country/etc for all months and years are now packed into a big 11G zip file on stat1:/home/spetrea/maxmind_archive.zip

2013-February-15
More work on Limn debianization. A new user is now created when installing the limn package.

The package is installing limn in /usr/lib/limn

It is also creating a user called limn with a home directory /var/lib/limn.

2013-February-14
Evan Rosen asked that we collaborate in finding a unified definition of a mobile pageview. This page was created and it will be updated to reflect that definition "mobile pageview".

The graph inside is created using graphviz(more info on the dot syntax can be found here). If anyone is reading this and would like to contribute, please contact someone from the Analytics team.

2013-February-13
Taking some days off next week(not now).

Started working on debian package for Limn.

The 500M bump
This is the biggest problem we currently have with the new mobile pageviews reports.

We tried multiple approaches so far:


 * we tried eliminating all /w/api.php requests with action=opensearch
 * we tried disabling bots discarding
 * we tried removing /wiki/Special: urls
 * we tried to disable our 20x/30x check
 * we tried checking the mimetype density of processed requests in december 2012 before and after 14dec.

What we found so far is that:


 * the problem is most likely limited to the /wiki/ urls
 * the problem is limited to the urls with mimetype "-"

What we want to do next:

Classify the mimetype "-" requests with /wiki/ urls by


 * ip
 * url

This is part of a drilldown process so we can find out some features these requests in the 500M bump share together.

2013-February-12
New UI changes for Mobile reports.
 * legend has all the details about all the numbers in the table cell.
 * breakdown and discarded piecharts are now optional (selected through a checkbox)
 * the wiki and api counts are now optional (selected through a checkbox)

A sample was created with processed entries with mimetype "-" for days 1-14dec and 15-31dec.

Top 200 referer domains in december 2012 was generated.

Top 200 referer domains in december 2012 after 14dec.

Top 200 referer domains in december 2012 before 14dec.

New mobile pageviews reports
We have to drill down in the data so we'll make histograms with the mimetypes 1-14december and 15-31december 2012 to find out which of the mimetypes have a bigger share of the total.

Mimetype density chart made for 1-14dec, 15-31dec 2012.

Maxmind's database and the IP block re-allocation
The maxmind database changes as blocks of IPs are re-assigned regularly. We are using Maxmind's database indirectly through udp-filter. There is an archive of all maxmind databases to which we have access to. Udp-filter and any program who does geolocation should take into account the date of the log entry when the geolocation is done.

A solution to this would be to load all the maxmind databases in memory when doing geolocation and depending on the time of the log file to use the appropriate database.

This also applies to bot detection. We currently have in wikistats code that relies on various IP ranges. These IP ranges change across time. I'm not aware of a list of Google, Bing, Yahoo bot IP ranges across time (but it would be very helpful if we could find one)

The problem with Maxmind's geoip database is directly related to the country reports. Because the blocks get re-allocated the counts will be affected from one month to the other.

Ideally we should use different maxmind dbs for different time intervals.

What I'm currently working on
The main areas of focus are:


 * Country report( requested by Amit Kapoor )
 * New Mobile pageviews report (requested by the Mobile Team, in particular Tomasz Finc)
 * Solving bugs in wikistats (the bugs present in Asana requested by Erik Zachte )
 * Limn debianization
 * Device detection through the dClass library (requested by the Mobile Team)

New mobile pageviews reports
It now takes ~2h to generate reports for an entire year of data. And we can write tests for them as well because the functionality is split into classes. In this particular case, we have added map-reduce logic which can crunch the data in parallel. We also use templating to separate html/js code and rendering-specific code from the rest of the code.

Currently we're experiencing some difficulties with the months November and December 2012 and onwards because the API for mobile has changed and there are multiple requests per pageview. The vast majority of these requests are in /wiki/ requests as can be seen in revision 25 of the report here.