User:Spetrea

I keep this logbook/diary/journal to remember what to tell people when they ask me what I'm working on.

At the same time it has a second role, to write down things that I find useful so I can use them or refer to them later.



Stuff I've been fiddling with:


 * working on Hive queries for kraken
 * wikimetrics project, did some work on the survivor/threshold metrics and asynchronous cohort validation.
 * wikistats project ( gerrit repository, github mirror)
 * debianize scripts -- helper scripts to make debianization easier (discontinued in favor of git buildpackage)
 * webstatscollector
 * udp-filter
 * dClass device detection debian package and jni extension (was released, but the original author created support for JNI as well, The libdclass-dev.deb we created will be available indefinitely)
 * uaparser perl port of the ua-parser project
 * XS Perl extension to parse fields (this was specifically built for the new mobile pageviews reports, was discontinued due to lack of time)

If you have any questions, well ... hopefuly you won't have any questions (this page is created for the purpose of answering those questions).

You may find inactive links here(as I intend to write here for an extended period of time), and you may try archive.org for those inactive links.

alternatives to google hangouts
There are alternatives to google hangouts. For the voip(audio) feature one solution is mumble(wiki article) which is a open source, low-latency, high quality voice chat software primarily intended for use while gaming. The difference here between google hangouts and mumble is that the server is also provided so it's a complete solution from that point of view(I've been using it for quite some time now and I can say it's awesome, it requires zero configuration on server-side, it just works and it works great). Mumble also offers text-chat.

For live screensharing vnc is one solution, and one implementation is tightvnc in read-only mode which would offer the same screensharing feature that google hangouts does.

For video, mumble has some plans to implement this in the near future. But in the meantime ffmpeg can create video streams of your webcam perfectly fine.

The problem with these solutions is that they are all fragments of google hangouts. Integration would require some work. But if you're interested in just audio conferences, text-chat then mumble is a good solution for that.

Tightvnc also works ok, although if you want to use vncserver on your machine and you're behind some router and don't have a public IP you're probably going to have to do some trickery with a free DNS of your choice or maybe tunnel it through some machine on labs. In the easy situation where the router's IP is public you can just forward port 5900 to the machine that hosts vncserver. The same goes for murmur(the server associated with mumble) and the port 64738.

Another option is jitsi which was quite bloated and slow the last time I tried it but it worked.

There are some solutions over the SIP protocol, I think those are annoying to configure so I'll leave them out(for now, maybe I'll have a look at them later although I doubt it).

So, why not just use Google Hangouts? It was running very well about 6 months ago, but since then, they keep adding stuff to it. Personally I'm not sure what they're adding, I was really satisfied with what they had back then, I don't understand why they keep changing it, it worked perfectly fine. It's also closed source which makes it a bit strange(although libjingle seems to have nearly all of google talk plugin functionality, that leaves the question where's the rest of it ? unanswered, since it's closed source).

sharing terminal sessions or screencasts
I find it useful to sometimes record a demo, but only when I have something to show. I've made some screencasts in the past with ffmpeg. A ~20 minute screencast recorded with ffmpeg takes around 20-30MB. Recently I found out about showterm.io and asciinema which also allow you to make casts of your terminal. The main difference is the recordings made with showterm/asciinema are purely text-based as opposed to video. I haven't tried them, not sure if they support audio.

alternative to Vagrant's default provider
While VirtualBox is the default provider for Vagrant there are other ones as well. One notable alternative is LXC(linux containers) which is considerably faster than VirtualBox. You can try it out like this:

vagrant plugin install vagrant-lxc

Afterwards, you can get a LXC packed box from vagrantbox.es (at the time of writing this, there were LXC boxes for CentOS, Debian, Ubuntu, OpenMandriva).

You can read more about vagrant-lxc here.

Some drawbacks I can see are property names for the Vagrantfile have different names when using vagrant-lxc than with the default provider(VirtualBox).

There are some other network-related limitations(at this time) for vagrant-lxc.

wandering around
Talked to Ori on IRC. Found out about a project called operations/software/mwprof which is (from my limited understanding) a way of profiling live instances of mediawiki, in particular counting of method calls.

It's similar to webstatscollector, but written for a totally different purpose. I talked to Ori and found out he wanted JSON output instead of XML. So I made a gerrit change with a JSON schema, validation, and some code to write the stats for it in JSON format using json-glib.

2013-December-13
Got the 3rd set of memories. This one doesn't work either. Tommorow I'll have to go at it again, but this time in the computer hardware store, and have other people assemble it and test it for me on the spot. Turns out I'm not a hardware person.

memory compatibility
As mentioned before, the problem was with the memory and not with the disks.

I tried two different sets of RAM, none of which worked, they are incompatible with the motherboard (I was using 8GB RAM cards and was unaware of the frequencies of RAM cards I had to choose until today)

I read tomshardware forums, called Kingston offices and asked about the problem, none of which helped. In the end, I found out the answer through conversations with a friend and by reading the motherboard manual. I also found a QVL inside the manual indicating that ASUS has already tested the motherboard with a lot of different memories from different vendors for compatibility.

Apparently the motherboard only accepts 1,2,4 GB memory DIMMs.

Ordered another set of 4 x 4GB DIMM cards.

It's quite clear I'm not a hardware person by any stretch of the means.

ramdisks, SSDs/SSHDs for performance
Let's say you have a VM that was running something that made extensive use of the disks and you have large amounts of RAM available.

You could make a ramdisk, move your VM to the ramdisk, and then link the VirtualBox directory back to $HOME so vagrant/VirtualBox can find it.

mount -t tmpfs -o size=4096M tmpfs /mnt/ramdisk1 mv ~/VirtualBox VMs/ /mnt/ramdisk1/ cd $HOME ln -s "/mnt/ramdisk1/VirtualBox VMs"

Then you could run and use your VM which now would be stored completely within memory. Of course that would also mean you would lose the contents of the ramdisk once you reboot the physical machine, but you can back it up right before you do, and put it back in the ramdisk when you boot up again.

Another solution is putting the same ~/VirtualBox VMs directory on an SSD or SSHD.

HDD death & recovery
Got some 16GB of memory for the machine that runs the cluster. Bought an 64GB SSD as well. As soon as I was able to plug the SSD in, the HDD failed. So, hard-disk failure on that machine. New hardware is on the way, should arrive in the next 3h. (have to do some data recovery as well). Unexpected and weird as that HDD was from 2011 (2 years is a long HDD lifespan ? read some stuff on forums about some of these Seagates failing after 1-2 months, maybe it's more about luck). Apparently my usb stick is nowhere to be found, so I had to get one of those as well to do a Ubuntu reinstall on the desktop machine. Any disk-intensive operation causes all processes on the machine to segfault, then some memory corruption occurs and I'm forced to reboot it.

Found a way to do an integrity check for the b0rked HDD.

sudo smartctl -a /dev/sda | less

This doesn't output much (nothing).

Moreover, the failures started happening exactly after I tried to move the Virtualbox .vbox files for the VMs from the HDD => to the SSD.

Update: Did some trial & error, took out disks, put them back in, took out new RAM, put old RAM back in, tried and tried. Turns out the new memory is the problem. The memory is not compatible with the chipset of the motherboard. Nobody told me that. Have to solve it tommorow.

A small dream
If I had infinite time in each day, one of those days I'd write a crawler to get all computer parts specs, throw them in tables in a DB. Then I'd be able to run queries like this one:

SELECT motherboard.model, cpu.model, ram.model, gfxcard.model, hdd.model, ssd.model FROM motherboard JOIN cpu ON (motherboard.cpusocket = cpu.socket AND motherboard.chipset = cpu.chipset) JOIN motherboard ON motherboard.gfxport = gfxcard.port JOIN ram_mobo_compatibility ON (ram_mobo_compatibility.mobo = motherboard.model AND ram_mobo_compatibility.ram = ram.model ) [...];

Then I'd have no problem, everything would be awesome, I could just make a query, find what I need to get, get it, and it would just work and I could just work on what I have to work on and everyone would be happy and the sun would shine on the sky even in wintertime and foggy days. And I could also get what I'd need at the lowest price too. Except I don't have that time, and I have to do it by hand

2013-December-10
Made some progress on the UDF. Still getting some errors, have to figure them out.

Some background on ways of extending Hive
There are 3 ways of extending Hive's query language:


 * User-Defined Functions (UDF)
 * User-Defined Aggregation Functions (UDAF)
 * User-Defined Table Functions(UDTF)

For the purpose of card 1227 we're interested in UDFs.

UDFs should not be confused with SerDes (Serializer/Deserializer). They are both ways of extending the query language but while SerDes seem to be concerned more with serializing/deserializing data(for example the JSON SerDe), UDFs are more geared towards processing/transformations/extraction logic on one or more columns inside a table.

There are at least two types of UDFs:


 * hive.ql.exec.UDF
 * hive.ql.udf.generic.GenericUDF

When developing an UDF one extends one of those two classes and implements various methods for them. For plain UDFs it's much easier than for GenericUDFs. There's a lot more specific datatypes that need to be used for GenericUDFs as this blog post shows. It's also more involved as you have to implement more details of the UDF. For generic UDFs apparently hive.ql.udf.generic.GenericUDF the official documentation is more scarce than for plain UDFs. There are of course other forms of documentation, such as.. reading generic UDFs that other people implemented(or blog posts) and trying to deduce how they work or what is necessary in order to implement one.

For the simple UDFs the documentation is here 1. Apparently the GenericUDFs are used when the input/output is more complex. That's the thought-trajectory/motivation behind reading about GenericUDFs.

More to come soon.

2013-December-04
Read documentation on UDFs

My Standup update
Extend the UDF to get multiple values out of it. The following things will be accessible:


 * minor browser version
 * major browser version
 * browser vendor

Then I can write queries for all the 3 required groupings.

I'm currently reading up on UDFs and GenericUDFs about this. Now the UDF just outputs one text value(the vendor) so I have to extend it to return a Map instead.

Maybe I'll get some time to write some tests as well for this.

Sample output from the Mobile team
This particular section contains sample output for card 1227. This was provided to us by the mobile team. So it's roughly what they would expect.

Vendor table
Mobile Safari 35.87% (4064512) Android 25.14% (2848976) Other 9.45% (1070491) Chrome Mobile 6.35% (719102) Opera Mini 4.55% (515398) total 100.0% (11331934)

Report major version
Mobile Safari 6 29.8% (3377452) Android 4 13.84% (1568531) Android 2 11.2% (1268731) Other unknown 9.45% (1070491)

Report minor version
Mobile Safari 6.0 27.3% (3093981) Android 2.3 9.79% (1108994) Other unknown 9.45% (1070491) Android 4.1 7.27% (823372) Android 4.0 6.2% (702782) Googlebot 2.1 4.25% (481245) Chrome Mobile 18.0

Updated Onboarding page
Update parts of the onboarding page for the analytics team with details on hardware, useful software and mailing lists.

Hive query
Here's some sample output for the query I'm working on(before I added the percentages column): Windows 7,39079 iOS,13776 Windows XP,12781 Android,12006 Mac OS X,6659 Windows 8,4546 Other,3950 Windows Vista,2713 Windows,1209 Linux,839

It's yet to be decided whether we'll use dClass or ua-parser in the final query for this. The output and current implementation are using the java implementation of ua-parser.

Personally I'd like to implement UDFs and queries for both dClass and ua-parser to be able to compare them and have enough information to take an informed decision on which to use.

Some observations about Hive and the cluster I'm playing with
Almost forgot about this page for a while.

So a simple SELECT stuff FROM table LIMIT 10; would take around 40 seconds because of overhead(loading up hive, adding jars for serde/udfs, switching database, telling hive how to find some udfs you'll use in queries).

Writing a Hive UDF to count percentages of browser vendors and versions. It's working out well so far. Hive syntax has differences from that of MySQL. Apparently for each additional thing to a query, Hive creates a new job. For example, you want a sub-query, a new job, you want to add a LIMIT clause.. a new job, you want to INNER JOIN, a new job. All these jobs have a 20-30second overhead, so it all adds up.

I suspected this would turn out to be a problem. Now I have a query with a couple of joins and stuff, and I apparently managed to hit ~2m30s running time with a rather simple query. It's still ok I guess. Actually it's not ok... it depends how you view this. If you compare it to mysql, yeah, it has a lousy overhead. I'm actually still trying to cope with this overhead that Hive forces upon you while developing a query or a udf for it. Apparently there's no way around it. If you have any thoughts on this, I'd be very happy [mailto:i-get-so-much-email-I-can-barely-go@through-it(sarcasm).com to hear from you].

I have my own hand-rolled Hadoop/Hive cluster set up on one of my machines over here. Setting Hadoop/Hive is a pain, but at least I get full control over them and get to learn some puppet and vagrant in the process.

I unsuccesfuly tried to manually install hue. Got it running, logged in with some user I inserted manually through its sqlite user db(had to forward hue ports from master to dev laptop), then tried to run some hive queries on hue and apparently it just hangs. So much for my dream of using hue. Anyway, it seems it had some problems with the hiveserver2 configuration to which hue talks to. Hue being what it is, namely, a Python web app that talks to hiveserver2 through the thrift protocol there's lots of xml and configuration, I probably messed up some configurations. Certainly spent lots of time reading forums and mailing lists for it, but.. to no avail.

I am thinking about converting all this setup to puppet-cdh4 which has hue and all the stuff already packaged.

This is a simple ascii diagram of how I run this home-grown hadoop setup. Of course, it's not very detailed or useful(maybe later). I made the diagram using App::Asciio (you can also get it from the asciio package).

.--.                          | desktop                          | | machine                         | |                                 |                           |                                  |                           |    master   backup               | |    ____     ____                |                           |    |    |   |    |               |                           |    |____|   |____|               |   dev                     |    /::::/   /::::/               | laptop                  |       ^                          | ____                    |       |                          | |    |  --'                          | |____|     ssh & sshfs    |                                  | /::::/                   |    hadoop1  hadoop2  hadoop3     | |    ____     ____     ____       |                           |    |    |   |    |   |    |      |                           |    |____|   |____|   |____|      |                           |    /::::/   /::::/   /::::/      |                           |                                  |                           '--'

2013-April-10
It's been a while since I've updated this page.

500M bump was solved.

This is because we agreed to use mingle instead. So right now I'm working on card-60 and card-551.

The card 551 is focused at documentation, while the card 60 is focused on Mobile paveviews reports.

The documentation is also available under mediawiki format.

It's currently being generated automatically in 3 formats:


 * HTML
 * LaTeX (so it's available in pdf format)
 * POD
 * MediaWiki

2013-March-07
Attempted to fix 500M bump:


 * Modified code for new mobile pageviews to have the same logic in embr_py code. 500M bump still not solved


 * Modified code to use the same languages as embr_py. 500M bump still not solved


 * Used /a/squid/archive/sample instead of /a/squid/archive/sample-geocoded. 500M bump still not solved


 * dropped XS module completely, ported to Perl code, simplified a lot. 500M bump still not solved


 * sequential run(previous ones were parallel) to assure that the bump isn't related to the parallel model 500M bump still not solved

2013-March-06
Fixed a segfault in webstatscollector time_travel branch

2013-February-22
Today work will be done on the 500M bump.

An issue was made to treat backporting in debianize. Not sure what priority this has. I hope to find out more about this in the standup.

2013-February-21
Polished some changes to webstatscollector time_travel branch. Gerrit got in the way, not letting me put a new patchset to the already existing review. But I managed to get the changes in through another changeset.

This change is replacing spaces with underscores for the page titles in pagecounts output of filter in webstatscollector.

Evan Rosen published his research on the way Maxmind's different versions of its database affects the page views statistics over time.

2013-February-20
Provided support for building dClass on OSX, added detailed instructions in the README here.

2013-February-17
Erik provided some very useful details about the logic of wikistats. We'll use them in the new mobile pageviews report. Maybe we can find a way to overcome the 500M bump.

Evan provided this diagram which is the way he is counting pageviews. This will be implemented and compared with the logic in the new mobile pageviews reports.

New features in git2deblogs.pl the debianize:


 * --update implemented
 * tests for --update
 * travis CI integration by writing the .travis.yml file

All tests pass, debianize is green.

All maxmind databases for city/country/etc for all months and years are now packed into a big 11G zip file on stat1:/home/spetrea/maxmind_archive.zip

2013-February-15
More work on Limn debianization. A new user is now created when installing the limn package.

The package is installing limn in /usr/lib/limn

It is also creating a user called limn with a home directory /var/lib/limn.

2013-February-14
Evan Rosen asked that we collaborate in finding a unified definition of a mobile pageview. This page was created and it will be updated to reflect that definition "mobile pageview".

The graph inside is created using graphviz(more info on the dot syntax can be found here). If anyone is reading this and would like to contribute, please contact someone from the Analytics team.

2013-February-13
Taking some days off next week(not now).

Started working on debian package for Limn.

The 500M bump
This is the biggest problem we currently have with the new mobile pageviews reports.

We tried multiple approaches so far:


 * we tried eliminating all /w/api.php requests with action=opensearch
 * we tried disabling bots discarding
 * we tried removing /wiki/Special: urls
 * we tried to disable our 20x/30x check
 * we tried checking the mimetype density of processed requests in december 2012 before and after 14dec.

What we found so far is that:


 * the problem is most likely limited to the /wiki/ urls
 * the problem is limited to the urls with mimetype "-"

What we want to do next:

Classify the mimetype "-" requests with /wiki/ urls by


 * ip
 * url

This is part of a drilldown process so we can find out some features these requests in the 500M bump share together.

2013-February-12
New UI changes for Mobile reports.
 * legend has all the details about all the numbers in the table cell.
 * breakdown and discarded piecharts are now optional (selected through a checkbox)
 * the wiki and api counts are now optional (selected through a checkbox)

A sample was created with processed entries with mimetype "-" for days 1-14dec and 15-31dec.

Top 200 referer domains in december 2012 was generated.

Top 200 referer domains in december 2012 after 14dec.

Top 200 referer domains in december 2012 before 14dec.

New mobile pageviews reports
We have to drill down in the data so we'll make histograms with the mimetypes 1-14december and 15-31december 2012 to find out which of the mimetypes have a bigger share of the total.

Mimetype density chart made for 1-14dec, 15-31dec 2012.

Maxmind's database and the IP block re-allocation
The maxmind database changes as blocks of IPs are re-assigned regularly. We are using Maxmind's database indirectly through udp-filter. There is an archive of all maxmind databases to which we have access to. Udp-filter and any program who does geolocation should take into account the date of the log entry when the geolocation is done.

A solution to this would be to load all the maxmind databases in memory when doing geolocation and depending on the time of the log file to use the appropriate database.

This also applies to bot detection. We currently have in wikistats code that relies on various IP ranges. These IP ranges change across time. I'm not aware of a list of Google, Bing, Yahoo bot IP ranges across time (but it would be very helpful if we could find one)

The problem with Maxmind's geoip database is directly related to the country reports. Because the blocks get re-allocated the counts will be affected from one month to the other.

Ideally we should use different maxmind dbs for different time intervals.

What I'm currently working on
The main areas of focus are:


 * Country report( requested by Amit Kapoor )
 * New Mobile pageviews report (requested by the Mobile Team, in particular Tomasz Finc)
 * Solving bugs in wikistats (the bugs present in Asana requested by Erik Zachte )
 * Limn debianization
 * Device detection through the dClass library (requested by the Mobile Team)

New mobile pageviews reports
It now takes ~2h to generate reports for an entire year of data. And we can write tests for them as well because the functionality is split into classes. In this particular case, we have added map-reduce logic which can crunch the data in parallel. We also use templating to separate html/js code and rendering-specific code from the rest of the code.

Currently we're experiencing some difficulties with the months November and December 2012 and onwards because the API for mobile has changed and there are multiple requests per pageview. The vast majority of these requests are in /wiki/ requests as can be seen in revision 25 of the report here.