Growth/Analytics updates/Work log/2018-10-05

From mediawiki.org

Account creations from mobile[edit]

In T205760, I was asked to figure out what percentage of accounts were created from mobile devices vs. desktop. This should exclude auto-created accounts, and it would be preferable if it could be split further by device (e.g. iOS and Android).

I found the ServerSideAccountCreation schema, which should enable me to answer these questions. From what I can tell by exploring the data, if the isApi flag is set the account was most likely created using a mobile app. Access to account creation through the API is not restricted to the mobile apps, so we might encounter non-app creations. While this binary heuristic is crude, it did appear to flag the vast majority of cases I looked at, and this project did not allow for developing a more sophisticated heuristic.

If the isApi flag is not set, we can distinguish between the mobile and desktop sites using the displayMobile flag. EventLogging also has the webhost field, but that turns out to never be set to the mobile site in this case (perhaps why the displayMobile flag is there?) These two flags combined allows us to split account creations into three categories: apps, mobile, and desktop.

Exploring the data further I found that I can use the os_family property of the event to identify Android and iOS devices. I chose to not investigate further splits (e.g. identifying Windows phones), and instead put everything that is neither Android nor iOS in "other".

I was interested in getting data from 2016-01-01 onwards and plot a monthly time series, but this was hindered by the Data Lake only having consistent data starting in early February 2018. More data is available, but the tight deadline did not allow for merging of these data sources. Instead, I chose to use 6 months of data, covering February through July 2018. The data for February is not complete, but I am assuming that this affects every category equally since it's a time-based exclusion. Since the task is to get proportions and not specific counts, this should not substantially affect the results.

I ended up writing the HQL query below to grab data for this from the Data Lake. The name of the resulting table and the wiki it grabs data from is changed for each of the six wikis we are grabbing data from.

DROP TABLE IF EXISTS nettrom_growth.cs_creations_by_site;
CREATE TABLE nettrom_growth.cs_creations_by_site
AS SELECT SUBSTRING(u.user_registration_timestamp, 1, 7) AS reg_month,
   SUM(IF(e.is_api = FALSE
          AND e.is_mobile = FALSE
          AND (e.os_family NOT IN ('Android', 'iOS')), 1, 0)) AS n_desktop_other,
   SUM(IF(e.is_api = FALSE
          AND e.is_mobile = FALSE
          AND e.os_family = 'Android', 1, 0)) AS n_desktop_android,
   SUM(IF(e.is_api = FALSE
          AND e.is_mobile = FALSE
          AND e.os_family = 'iOS', 1, 0)) AS n_desktop_ios,
   SUM(IF(e.is_api = FALSE
          AND e.is_mobile = TRUE
          AND (e.os_family NOT IN ('Android', 'iOS')), 1, 0)) AS n_mobile_other,
   SUM(IF(e.is_api = FALSE
          AND e.is_mobile = TRUE
          AND e.os_family = 'Android', 1, 0)) AS n_mobile_android,
   SUM(IF(e.is_api = FALSE
          AND e.is_mobile = TRUE
          AND e.os_family = 'iOS', 1, 0)) AS n_mobile_ios,
   SUM(IF(e.is_api = TRUE
          AND (e.os_family NOT IN ('Android', 'iOS')), 1, 0)) AS n_app_other,
   SUM(IF(e.is_api = TRUE
          AND e.os_family = 'Android', 1, 0)) AS n_app_ios,
   SUM(IF(e.is_api = TRUE
          AND e.os_family = 'iOS', 1, 0)) AS n_app_android
FROM wmf.mediawiki_user_history u
JOIN (SELECT event.userId AS user_id,
             event.displayMobile AS is_mobile,
             event.isApi AS is_api,
             useragent.os_family AS os_family
      FROM event.serversideaccountcreation
      WHERE year >= 2018
      AND wiki = 'cswiki') AS e
ON u.user_id = e.user_id
WHERE u.wiki_db = 'cswiki'
AND u.snapshot = '2018-08'
AND u.caused_by_event_type = 'create'
AND TO_DATE(u.user_registration_timestamp) >= '2018-02-01'
AND TO_DATE(u.user_registration_timestamp) < '2018-08-01'
AND u.created_by_self = TRUE
AND u.created_by_system = FALSE
AND u.created_by_peer = FALSE
AND u.is_bot_by_name = FALSE
GROUP BY SUBSTRING(u.user_registration_timestamp, 1, 7);

Proportion of desktop, mobile, and apps[edit]

Let's first look at the top-level split, between the different sites/methods: apps, desktop, and mobile. First, I use the whole dataset and calculate the overall proportion of registrations for each. This results in six bar charts, seen below.

What we can see is that for five out of the six wikis, the majority of registrations come through the desktop site. The Arabic Wikipedia is the exception, where 57.5% of the registrations are on the mobile site, and about one third of the registrations are on desktop. The proportion of mobile registrations on the five others appears to lie in the 20–30% range.

Next, I want to know if these proportions have changed over time. We therefore calculate the same proportions on a monthly basis for the six months in our datasets. The time series graphs are shown below.

The general trend in all these plots is that the proportion of registrations through the mobile site is stable, and there's a reduction in desktop registrations in favour of registrations through the mobile app. This means that our previous plots of the overall proportions is slightly overestimating the desktop registrations, while underestimating the app registrations, compared to the more recent months.

Proportion of Android and Apple[edit]

Our datasets also allow us to split each site by whether the registration happened on an Android, iOS, or other type of device. Because the overarching question is registrations on apps/mobile/desktop, I decided to focus this on splitting each of those sites, rather than answer the question "what's the overall proportion of Android/Apple/other?"

Splitting up the dataset and calculating proportions for each site results in the six bar charts seen below.

In these graphs, one overall trend is that iOS dominates the apps registrations. German Wikipedia is an exception, where there is a large number of registrations falling into the "other" category. We use the isApi flag to categorize account creations into the "apps" category, and while that might be correct in the vast majority cases, account creation through the API is not restricted to the apps. In this case it did not seem worthwhile to spend time on developing a more fine-tuned set of heuristics in order to filter out non-app account creations, which means that we end up with cases like this.

Registrations on the desktop site is typically "other", with Arabic and Korean showing a fair proportion of registrations from Android and iOS devices. Lastly, the trend of Android vs iOS is switched for the mobile site, typically 80/20 in Android's favor. The German Wikipedia is the exception here where it is more even.