User:Spetrea

Hello,

My name is Stefan Petrea. For more information visit my website.

I'm working on the following at the moment:


 * wikistats project
 * dClass device detection debian package and jni extension
 * uaparser perl port of the ua-parser project
 * XS Perl extension to parse fields (this was specifically built for the new mobile pageviews reports)

If you have any [mailto:stefan@garage-coding.com questions or requests contact me at this e-mail].

New mobile pageviews reports
We have to drill down in the data so we'll make histograms with the mimetypes 1-14december and 15-31december 2012 to find out which of the mimetypes have a bigger share of the total.

Mimetype density chart made for 1-14dec, 15-31dec 2012.

Maxmind's database and the IP block re-allocation
The maxmind database changes as blocks of IPs are re-assigned regularly. We are using Maxmind's database indirectly through udp-filter. There is an archive of all maxmind databases to which we have access to. Udp-filter and any program who does geolocation should take into account the date of the log entry when the geolocation is done.

A solution to this would be to load all the maxmind databases in memory when doing geolocation and depending on the time of the log file to use the appropriate database.

This also applies to bot detection. We currently have in wikistats code that relies on various IP ranges. These IP ranges change across time. I'm not aware of a list of Google, Bing, Yahoo bot IP ranges across time (but it would be very helpful if we could find one)

The problem with Maxmind's geoip database is directly related to the country reports. Because the blocks get re-allocated the counts will be affected from one month to the other.

Ideally we should use different maxmind dbs for different time intervals.

What I'm currently working on
The main areas of focus are:


 * Country report( requested by Amit Kapoor )
 * New Mobile pageviews report (requested by the Mobile Team, in particular Tomasz Finc)
 * Solving bugs in wikistats (the bugs present in Asana requested by Erik Zachte )
 * Limn debianization
 * Device detection through the dClass library (requested by the Mobile Team)

Wikistats technical challenges
The technical challenges I'm facing with the wikistats codebase are the following:


 * speed (it takes 22h to generate a report for 1month of data)
 * hard to test because of speed issues
 * the code is hard to test because it is not factored
 * again refering to the first item, it only uses 1 CPU instead of 16 CPUs (which are available on stat1), that makes it slow
 * the code produces "hand-crafted" HTML/js  code instead of using templates so it is quite hard to fix a bug related to rendering as the report is being composed of little pieces spread across many functions
 * undocumented. should spend some time to document the WikiReport.pl script. I currently don't know how to run that. This also has impact on the mobile pageviews report which I don't know how to generate.
 * the code is procedural as opposed to object oriented. it does not offer encapsulation so there are lots of side-effects.
 * the code does not  (This is a best-practice in Perl because it prohibits the use of global variables and a lot of problems. Nearly all modern Perl code is using this. It is the analog of strict mode in Javascript)
 * the code uses global variables across modules. This fact causes problems in the separation of code and it does not create a context for separation of logic. It also has effects on the testability of the code.

All the problems above have a  Huge  impact on the turnaround time to solve a bug in wikistats, because the way I work is usually writing code, testing on a small dataset. But some of the bugs are only present on a big piece of data, and if the time to run the reports is high, then so will the time to fix them. This is even more problematic as there are requests from multiple teams for reports in a short period of time(even shorter when dwarfed by the complexity caused by the list above).

If we had everything separated neatly into classes and we would output the counts for one day of data in one file, then we could scale and also we could have testable code and write tests for it. Currently there is a way to write tests for wikistats which I implemented(and I encourage anyone working on the codebase to write as many such tests as possible) but because the code is not factored, the only way to test is to generate some log files, run SquidCountArchive.pl and SquidReportArchive.pl on the generated log files and then either parse the HTML reports generated and check if the values are correct, or, check if the values are correct inside the .csv files which are created after SquidCountArchive.pl runs. This also is unsatisfactory because this tests the code as a whole, instead of methods and classes, because classes and methods do not exist in wikistats at the moment. We only have modules which expose different functions.

This means that the type of testing that can be done is more of a blackbox testing approach instead of unit tests. The cascading effects of the current design go on and on. Another one of them is the impossibility of getting code coverage for wikistats because we cannot have unit tests in the first place.

So in a sense, it's better that we can write tests now, but this is nowhere near a desirable situation where we can test code properly.

Going back to the speed problems, if it takes 22 hours to generate a report for one month of data, that means that if we work on fixing a problem, and we have a fix ready, we have to wait 22 hours to see if the code actually fixed the problem for that month. Needless to say, with wikistats we have to wait 11 days, that's one week and 4 days to get reports for one year of data. So if we have to fix a bug that is only reproducible by running the reports for one year of data we have to wait 11 days just to get the reports for one year of data. Of course, the solution in that case is to narrow down the data in order to be able to reproduce the problem on a smaller scale.

New mobile pageviews reports
We have addressed all of these problems mentioned in the section above. in the new mobile pageviews report. It now takes ~2h to generate reports for an entire year of data. And we can write tests for them as well because the functionality is split into classes. In this particular case, we have added map-reduce logic which can crunch the data in parallel. We also use templating to separate html/js code and rendering-specific code from the rest of the code.

Currently we're experiencing some difficulties with the months November and December 2012 and onwards because the API for mobile has changed and there are multiple requests per pageview. The vast majority of these requests are in /wiki/ requests as can be seen in revision 25 of the report here.