Product Analytics/Onboarding

Welcome to the Product Analytics team in Wikimedia Product! This document is meant to get you up and running as quickly as possible. It covers a slew of topics, including an overview of various stores of data and how to access them securely. If you encounter something not clear or if you notice something major in your onboarding that is missing from this document, please don’t hesitate to ask for clarification, suggest a change, or edit this page so we can continue to improve this for future folks joining the team.

Phabricator Checklist[edit]

This is a template to copy/paste into a Phabricator ticket for onboarding purposes.

Verify:
[] Access to Office Wiki and Meta (from OIT) (WMF full-time staff only)
[] Added to Google groups:
 [] [[ https://groups.google.com/a/wikimedia.org/forum/?hl=en#!forum/product-analytics | Product Analytics Team ]] (team mailing list)
 [] [[ https://groups.google.com/a/wikimedia.org/g/product-analytics-announce | Product Analytics Announcements ]] (for our team to send updates to stakeholders)
 [] [[ https://groups.google.com/a/wikimedia.org/g/product-all | Product ]] (Product Department mailing list)
 [] [[ https://groups.google.com/a/wikimedia.org/g/foundation-official | Foundation Official ]] (official announcements from the Foundation)
 [] [[ https://groups.google.com/a/wikimedia.org/g/foundation-optional | Foundation Optional ]] (Foundation-wide mailing list)
 [x] [[ https://groups.google.com/a/wikimedia.org/g/engineering-all | Engineering All ]] (recommended, since our work overlaps with engineering; for engineers across the Foundation; automatic since product-analytics@ is a member)
[] Invited to team meetings:
 [] Key meetings for team(s) you support
 [] Product Analytics (planning, board refinement, retro, sharing)
 [] Product Analytics & Data Engineering (sync)
 [] Research Group Lab Meeting (optional)

Onboarding Checklist:
[] Create MediaWiki account
[] Create Wikitech account
[] GitHub
 [] Create account (if you don't already have one)
 [] Ask someone on the team to add you to /wikimedia-research and /wikimedia orgs
[] [Create Phabricator account](https://www.mediawiki.org/wiki/Phabricator/Help#Creating_your_account)
 [ ] Ask someone on the team to add you to the Product Analytics team on Phabricator.
[] [Request Phabricator permissions](https://phabricator.wikimedia.org/project/profile/61/) to see WMF/NDA restricted tickets
[] Request [[ https://wikitech.wikimedia.org/wiki/Production_access | Production Access ]] (see our [[ https://www.mediawiki.org/wiki/Product_Analytics/Onboarding#SSH_Keys,_Stat_machines,_Notebooks,_HUE,_Datagrip,_Groups | Onboarding Notes ]] for groups and machines needed)
[] Create your Meta user page ([[ https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF | here's an example! ]]) (WMF full-time staff only)
[] Add yourself to the [[ https://office.wikimedia.org/wiki/Contact_list | Staff contact list ]] (WMF full-time staff only)
[] Join Slack and our public team channel: working-with-data
[] Ask your manager to be invited to our private team channel
[] Ask your manager to be added to the @product-analytics ping group

Logins for:
[] Namely
[] Fidelity
[] Navia
[] Betterworks

Update:
[] [[ https://www.mediawiki.org/wiki/Product_Analytics/Onboarding | Onboarding Notes ]] if you find something that is incorrect or out of date

Access and security[edit]

To begin your Onboarding journey and for requesting access you will need to create accounts on the following websites:

MediaWiki: is home to the documentation for MediaWiki software and many of the teams at the Foundation, including ours. This account can be used on other Wikimedia projects like Meta wiki (which is where some teams document their work and processes), although we recommend you use a separate, personal account for volunteer work like editing Wikipedia.
Wikitech: is the home of documentation related to the technical projects and infrastructure maintained by the Foundation. This includes production systems, the Wikimedia Cloud Services computing infrastructure (virtual private server (VPS)), the Toolforge hosting environment (platform as a service (PaaS)), and many other technical projects such as the Beta cluster which is used to validate new software before deployment to the production cluster. This is your LDAP account.
Phabricator: is a collaboration platform open to all of Wikimedia. It is used for managing work related to software projects and is going to be the application using which work items will be tracked and assigned to you. You can create an account using your MediaWiki account created as a part of Step 1. Once you have a Phabricator account feel free to Learn how to use Phabricator and to visit the Product-Analytics group and our Workboard. Some Phabricator tasks have a restricted view due to the presence of confidential data such as IP addresses in the task description, comments or query results. In order to get access to Restricted tasks please follow instructions to open a Phabricator ticket to get added to the WMF-NDA project.
GitHub: is where a lot of Wikimedia's open source repositories are hosted and version controlled (under Wikimedia organization). Not all of the repositories you'll come across are maintained directly on GitHub – some are read-only versions, mirrored from Gerrit (WMF's current code review platform, eventually to be replaced with GitLab). Our team belongs to the wikimedia-research organization, where we upload our analyses and other projects. Once you have an account, contact Mikhail or another admin to add you as a member of the Product Analytics team and take a look at Git Handbook and/or Understanding the GitHub flow. Depending on the exact work you will be doing, you may need to also acquaint yourself with Gerrit through this tutorial because that is how we make and review changes to our event schemas repository.

Once this is complete, let's look at some of the technical tasks you will have to complete for the Onboarding...

SSH Keys, Stat machines, Notebooks, HUE, Groups[edit]

You'll need to generate two SSH key-pairs – ED25519 (or 4096-bit RSA) specifically; refer to these instructions for generating those types of keys – one for production use (accessing our analytics machines and our hosted version of JupyterHub) and one for Wikimedia Cloud Services use (if you need to create remote VMs). You'll need to file a Phabricator task tagged as SRE-Access-Requests where you will:

provide your shell name (from creating a developer account on Wikitech, see Product Analytics/Superset Access
provide the public key of the production pair (e.g. wmf_rsa.pub if wmf_rsa is the private key), and
request to be added to analytics-privatedata-users & analytics-product-users groups.

If you are a Foundation full-time employee, specify that you need to be added to the wmf group.

If you are a contractor, specify that you need to be added to the nda group. In addition to the NDA you signed for Legal, you'll also need to digitally sign Acknowledgement of Wikimedia Server Access Responsibilities document for that ticket. Your manager needs to comment on that Phabricator task, stating that they are your manager and that they approve this request.

See Production access: Filing the request for more details and a link to a Phabricator form that automatically adds required fields.

Disclaimer: There are no service-level agreements or SLAs defined by the SRE-Access-Requests team for processing access request tickets submitted by employees or contractors. This could mean that in general, access requests could take more time than anticipated and especially if there is a delay by another team (such as a delay in getting the signed NDA from Legal).

Kerberos: You will also need to obtain a password to authenticate with Kerberos, the system which acts as an additional security layer between you and Hadoop. Follow instructions on this page for creating a Phabricator task. Until you are able to authenticate with Kerberos using kinit, you will not be able to use the hive & beeline command line interfaces (CLIs), query Hive databases in Jupyter notebooks, or store/retrieve files in Hadoop Distributed File System (HDFS). As noted on that page, the authentication lasts for 24 hours.

Once you are added to analytics-privatedata-users and wmf/nda groups, you should be able to login to Hue (a web UI for exploring/querying our Data Lake) using your LDAP info (the username & password you registered on Wikitech aka your Wikimedia Developer Account, which is separate from your MediaWiki account). If you are not able to log in to HUE, open a Phabricator ticket tagged to SRE-Access-Requests, Analytics, Operations and request an admin from the Data Engineering team to sync your LDAP (wikitech username) account with your HUE account. You can also send a message on the #wikimedia-analytics node of IRC to get help from an admin member for this.

Hue is nice when you’re prototyping a query and then you can switch to running the query with hive CLI, R (via wmfdata-r package maintained by Mikhail) or Python (via wmfdata-python package maintained by Neil), either while SSH'd to stat1006/stat1007 or in a Jupyter notebook on Simple Wikimedia Analytics Platform (SWAP). You should also be able to use the same LDAP login info to access Turnilo and Superset, which are tools to visualize & dashboard the data stored in Druid data store.

Yubikeys and VPN[edit]

See this Office wiki page regarding VPN, if you need to have a secured connection on a public Wi-Fi spot. Office IT will issue you a Yubikey for connecting to the VPN, which you can also use for PGP encryption/decryption. (Contractors don’t have access to Office wiki, so email Office IT if you require VPN.) Note: Office IT may be moving away from VPN. Check with them to find out whether they are issuing VPNs or are recommending that new staff/contractors subscribe to a service like Private Internet Access separately (which can be reimbursed).

If you turn on two-factor authentication ("2FA") on your staff Google account (which you should also do on your personal Google account if you haven't already), you can also use that same Yubikey to pass the 2nd step.

Miscellaneous[edit]

It's recommended that you:

learn enough wikitext (if you don't know some already) to edit/create pages and link across wikis

check out Product Analytics' on-wiki data analyses and reports

For WMF staff:

put some info in your User page on Meta wiki (just that one place and then it's used across all wikis); see the source for User:Neil P. Quinn-WMF for an example
- this is also where you can list which languages you speak & fluency levels
add your contact info to the contact list on Office wiki (if applicable)

For contractors:

put some info in your User page on MediaWiki; see the source for User:Neil P. Quinn-WMF for an example

IT Equipment:

For any help with IT equipment or communication access, please feel free to reach out to Office IT: techsupportwikimedia.org

Data and Project Backups[edit]

If you want to store your projects on GitHub, we have an organization wikimedia-research that Mikhail can add you to. Just let him know your GH username. Here are some examples of repositories: SDoC metrics, movement-level Editing metrics, iOS metrics, SEO sameAs A/B test report, and analysis for Search Platform.

Please be careful with what data you commit and push so you don't accidentally upload PII and other private data. If you are publishing country level data, please double check whether it is in the Country Protection List. Somewhat related, a lot of WMF code (e.g. MediaWiki software/extensions and Analytics' Refinery) is on Gerrit. See this page for some resources. A few teams use GitHub instead (e.g. both mobile apps teams). It’s recommended that you generate an SSH keypair specifically for Gerrit. When you do, add the public key on the Settings page.

SSH configuration[edit]

Example config for production[edit]

Host *
    ForwardAgent no
    IdentitiesOnly yes

Host bast
    HostName bast1002.wikimedia.org
    #        ^ from https://wikitech.wikimedia.org/wiki/Bastion
    User bearloga
    IdentityFile ~/.ssh/wmf_rsa

Host *.wmnet
    User bearloga
    IdentityFile ~/.ssh/wmf_rsa
    ProxyCommand ssh -a -W %h:%p bast

Host stat7
    HostName stat1007.eqiad.wmnet
    User bearloga
    IdentityFile ~/.ssh/wmf_rsa
    ProxyCommand ssh -a -W %h:%p bast

# Host gerrit.wikimedia.org
#     Hostname gerrit.wikimedia.org
#     IdentityFile ~/.ssh/gerrit_rsa

Notes:

Fields to customize: User and IdentityFile
Mikhail (bearloga) connects through bast1002 in Virginia since that’s the bastion that's closest one to him. If you’re based in the Bay Area, use bast4005.
The default name for SSH keys is id_rsa, but wmf_rsa, labs_rsa , and gerrit_rsa are easy to remember if you created different pairs (one for production, one for CloudVPS, and one for Gerrit)
Last entry lets you connect to stat1007.eqiad.wmnet via just “ssh stat7” for convenience. You can also make ones for stat1004 and stat1006.

Example config for CloudVPS[edit]

From this documentation:

Host *.wmflabs
    User bearloga
    ProxyJump primary.bastion.wmflabs.org:22

Host primary.bastion.wmflabs.org
    User bearloga
    IdentityFile ~/.ssh/labs_rsa

Example config for SWAP[edit]

If you add the following lines to your ~/.bash_profile:

alias swap-s4="ssh -N stat1004.eqiad.wmnet -L 8000:127.0.0.1:8880"
# ...stat1005, stat1006, stat1007, stat1008 ...
alias swap-s8="ssh -N stat1008.eqiad.wmnet -L 8000:127.0.0.1:8880"

Then you’ll be able to just type swap-sX in Terminal to establish an SSH tunnel to a stat100X host and use Jupyter there by going to http://localhost:8880 in your favorite browser to start using Simple Wikimedia Analytics Platform (SWAP).

Example config for GitHub[edit]

To allow notebook machines to connect to an external site and do so via http, please use git clone. Open each notebook terminal and create .gitconfig file. Type the following into each document:

[url "https://[username]@github.com"]
  insteadOf = https://github.com
[credential]
  helper = cache --timeout=28800
[user]
      email = [Wikimedia email id]
      name = [GitHub username]

*Replace [username] with your own GitHub username and the email id that was used to register to GitHub.

Bash configuration[edit]

On stat100X, make a ~/.bash_profile with the following:

if [ -f ~/.bashrc ]; then
    source ~/.bashrc
fi

Then make a ~/.bashrc with the following:

export http_proxy=http://webproxy.eqiad.wmnet:8080
export https_proxy=http://webproxy.eqiad.wmnet:8080

This will enable you to access external resources (e.g. if you’re installing R packages from CRAN) from internal machines through the HTTP proxy.

Caveat is that if you need to access internal HTTP resources (e.g. accessing Druid through curl), you will need to unset http_proxy first.

Data Sources[edit]

Most of the data we work with is in Hadoop, maintained by Data Engineering (DE).

Databases in Hive[edit]

HiveQL Language Manual (especially the UDFs page) will be your new BFF (if it’s not already). If you need faster querying you can use Spark or Presto. The following are the major DBs/tables:

wmf[edit]

pageview_hourly & virtualpageview_hourly (for Page Previews)
- traffic (as Page View counts) for individual pages, across our ~1000 wikis
- includes:
  - parsed useragent strings (e.g. browser info, OS info)
  - IP-based geo-location (via our licensed MaxMindDB) down to city-level
  - referer_class
    - “none” for direct traffic
    - “internal” for when user visited one of our pages from another one of our pages
    - “external” (when they came to us from another website)
    - “external (search engine)” when they came to us from Google, Bing, etc.
- for more tables related to traffic data in the Data Lake, refer to this page
mediawiki_history
- monthly snapshots, usually available within a week of the new month
- contains all kinds of refined editing activity, including:
  - indicators when a revision has been reverted
  - revision change tags
  - indicator of when user was created by system (e.g. when you visit a wiki while logged-in on one wiki, an account specific to that wiki is auto-created for you; there is no such thing as a “global ID number”)
  - historical usernames, page titles, namespaces
  - indicator of whether the revision was deleted as part of the page getting deleted
- for more details on edit data in the Data Lake, refer to this page
webrequest
- contains refined HTTP logs, refined using this codebase & this pipeline
- includes: parsed UA, geolocation, IP addresses, ISP data, detailed referrer data, HTTP & cache status codes, content type (e.g. media files), full URI
- for more details on the columns, refer to this page

wmf_raw[edit]

CirrusSearchRequestSet is users’ searches, using the built-in search on Wikimedia projects. NOTE: this dataset has been deprecated; use `event.cirrussearch_request` in Hive.

event[edit]

TODO: change to introduction to Event Platform

EventLogging (EL) is a platform for modelling, logging, and processing analytic data. EL schemas are (currently) specified on Meta wiki in the Schema namespace, then engineers instrument the client-side implementation to send events to the EL endpoint. Those events contain fields defined in the schema, as well as metadata like User-Agent.

EL data is processed and stored in the event database, with a table for each active schema. These tables are partitioned by year, month, day, and hour. The User-Agent strings are parsed into structs, making it possible to query (for example):

WHERE useragent.os_family = 'Windows' AND useragent.browser_family = 'Firefox'

The event column is also a struct and contains the fields specified by the schema.

There is also event_sanitized, which contains sanitized EL data older than 90 days per our Data Retention Guidelines and Privacy Policy. Refer to this documentation for more details, including how to whitelist schemas and fields for retention (and hashing where appropriate).

MediaWiki Replicas[edit]

While DE’s plan is to eventually have all of MediaWiki (MW) edit data available in the Data Lake, the road there is dark and full of terrors. We only recently (starting with the April 2019 snapshot) started to have change tags for revisions, for example, so in the meantime if there is data on editing/editors you need and it’s not in the Data Lake or you can’t wait for the next snapshot to become available, you can connect to the MW replicas to query the latest edit data.

The database layout is described mw:Manual:Database layout.

Communications[edit]

Mailing lists[edit]

There are a few types of mailing lists: Google Groups and lists.wikimedia.org. Some lists which may be relevant to your work include:

For more, refer to mailing lists page on Office wiki.

If you are a contractor, you may not have access to these mailing lists. However, you should ask your manager to be added to the 'product-analytics@wikimedia' email group list. For collaboration, you can share your calendar with specific team members using the instructions on this page and also request team members to share their calendars with you. In order to book meeting rooms in the office, please reach out to techsupport@wikimedia for help with this.