Product Analytics/Style guide

From mediawiki.org
Jump to navigation Jump to search

In order to speed code exchange and review, the Product Analytics team follows these style guides. We encourage others involved in the Wikimedia analytics ecosystem to follow them as well, or suggest other style guide we should converge with.

General rules[edit]

These rules apply to all code, no matter the language:

  • Indent using 4 spaces
    • This is also used by Analytics Engineering (Python, Scala, Java, JavaScript, SQL), Search Platform (Java), and Site Reliability Engineering (Puppet).

SQL[edit]

We follow Kickstarter's SQL style guide, except that we indent with 4 spaces instead of 2.

Key points:

  • Write SQL keywords in all caps (e.g.SELECT, CONCAT(), AND)
  • Variable names are lowercase, with spaces represented by underscores (e.g.variable_name, not variableName).
  • AND goes at the beginning of a line, not at the end.

Python[edit]

We follow MediaWiki's Python coding conventions, which for the most part defer to PEP-8.

R[edit]

We follow the tidyverse style guide.

Key points:

  • Use lowercase letters and snake case (underscores for separation of words) for naming functions and variables.
  • %>% should always have a space before it, and should usually be followed by a new line. After the first step, each line should be indented by two spaces.
  • If a function call is too long to fit on a single line, use one line each for the function name, each argument, and the closing ).

Further recommendations:

  • Be liberal with code comments
    • If attaching a package namespace – e.g. library(purrr); x <- map(…) instead of x <- purrr::map(…) – add notes (reminders) why a particular package is being used and/or which specific function you're using.
    • If it's not immediately clear what something is, don't be afraid to clarify it in code.

For example:

library(wmfdata) # for query_hive()
library(glue)    # variable placeholder
library(dplyr)   # data wrangling
library(brms)    # bayesian regression modeling

query <- "
WITH bot_editors AS (
    SELECT
        country_code,
        COUNT(DISTINCT CONCAT(wiki_db, user_fingerprint_or_id)) AS bot_editors
    FROM wmf.editors_daily
    WHERE month = '${snapshot}'
        AND size(user_is_bot_by) != 0
        AND country_code != '--'
    GROUP BY country_code
), all_editors AS (
    SELECT
        country_code,
        COUNT(DISTINCT CONCAT(wiki_db, user_fingerprint_or_id)) AS editors
    FROM wmf.editors_daily
    WHERE month = '${snapshot}'
        AND country_code != '--'
    GROUP BY country_code
)
SELECT
    economic_region, all_editors.country_code, editors, bot_editors
FROM all_editors
LEFT JOIN bot_editors
    ON all_editors.country_code = bot_editors.country_code
LEFT JOIN canonical_data.countries AS countries
    ON all_editors.country_code = countries.iso_code;
"

snapshot <- "2020-11"
result <- query_hive(glue(query, .open = "${"))

bot_editor_rates <- result %>%
    mutate(
        bot_editors = as.integer(ifelse(bot_editors == "NULL", 0, bot_editors)),
        bot_editor_rate = bot_editors / editors
    )

zib_model <- brm(
    bf(
        bot_editor_rate ~ economic_region,
        zi ~ economic_region, # zero inflation
    phi ~ economic_region # precision
  ),
    family = zero_inflated_beta(),
    data = bot_editor_rates
)