Product Analytics/Onboarding

Welcome to the Product Analytics team in Wikimedia Product! This document is meant to get you up and running as quickly as possible. It covers a slew of topics, including an overview of various stores of data and how to access them securely. If you encounter something not clear or if you notice something major in your onboarding that is missing from this document, please don’t hesitate to ask for clarification, suggest a change, or edit this page so we can continue to improve this for future folks joining the team.

Access and security
You'll need to generate two SSH key-pairs – ED25519 (or 4096-bit RSA) specifically; refer to these instructions for generating those types of keys – one for production use (accessing our analytics machines and Jupyter notebook service "SWAP") and one for Wikimedia Cloud Services use (if you need to create remote VMs). See this page for more info. You'll need to file a Phabricator ticket tagged as SRE-Access-Requests where you will:


 * provide the public key of the production pair (e.g.  if wmf_rsa is the private key), and
 * request to be added to  &   groups.

If you are a Foundation full-time employee, specify that you need to be added to the  group. If you are a contractor, specify that you need to be added to the  group.

In addition to the NDA you signed for Legal, you'll also need to digitally sign Acknowledgement of Wikimedia Server Access Responsibilities document for that ticket. Your manager needs to comment on that Phabricator task, stating that they are your manager and that they approve this request.

Once you are added to  and  /  groups, you should be able to login to Hue (a web UI for exploring/querying our Data Lake) using your LDAP info (the username & password you registered on Wikitech aka your Wikimedia Developer Account, which is separate from your MediaWiki account).

Hue is nice when you’re prototyping a query and then you can switch to running the query with hive CLI, R (via wmf package maintained by Mikhail) or Python (via wmfdata package maintained by Neil), either while SSH'd to stat1006/stat1007 or in a Jupyter notebook on Simple Wikimedia Analytics Platform (SWAP). You should also be able to use the same LDAP login info to access Turnilo and Superset, which are tools to visualize & dashboard the data stored in Druid data store.

Yubikeys and VPN
See this Office wiki page regarding VPN, if you need to have a secured connection on a public Wi-Fi spot. Office IT will issue you a Yubikey for connecting to the VPN, which you can also use for PGP encryption/decryption. (Contractors don’t have access to Office wiki, so email Office IT if you require VPN.) If you need to use a VPN when using public WiFi, you would need to subscribe to a service like Private Internet Access separately.

If you turn on two-factor authentication ("2FA") on your staff Google account (which you should also do on your personal Google account if you haven't already), you can also use that same Yubikey to pass the 2nd step.

Miscellaneous
It's recommended that you:


 * put some info in your User page on Meta wiki (just that one place and then it's used across all wikis); see the source for User:Neil P. Quinn-WMF for example
 * this is also where you can list which languages you speak & fluency levels
 * add your contact info to the contact list on Office wiki (if applicable)
 * learn enough wikitext (if you don't know some already) to edit/create pages and link across wikis
 * check out Product Analytics' on-wiki data analyses and reports

If you want to store your projects on GitHub, we have an organization wikimedia-research that Mikhail can add you to. Just let him know your GH username. Here are some examples of repositories: SDoC metrics, movement-level Editing metrics, iOS metrics, SEO sameAs A/B test report, and analysis for Search Platform.

Please be careful with what data you commit and push so you don't accidentally upload PII and other private data. Somewhat related, a lot of WMF code (e.g. MediaWiki software/extensions and Analytics' Refinery) is on Gerrit. See this page for some resources. A few teams use GitHub instead (e.g. both mobile apps teams). It’s recommended that you generate an SSH keypair specifically for Gerrit. When you do, add the public key on the Settings page.

Example config for production
Host * ForwardAgent no   IdentitiesOnly yes

Host bast HostName bast1002.wikimedia.org #       ^ from https://wikitech.wikimedia.org/wiki/Bastion User bearloga

IdentityFile ~/.ssh/wmf_rsa

Host *.wmnet User bearloga IdentityFile ~/.ssh/wmf_rsa ProxyCommand ssh -a -W %h:%p bast

Host stat7 HostName stat1007.eqiad.wmnet User bearloga IdentityFile ~/.ssh/wmf_rsa ProxyCommand ssh -a -W %h:%p bast

Notes:
 * 1) Host gerrit.wikimedia.org
 * 2)     Hostname gerrit.wikimedia.org
 * 3)     IdentityFile ~/.ssh/gerrit_rsa


 * Fields to customize:  and
 * Mikhail (bearloga) connects through bast1002 in Virginia since that’s the bastion that's closest one to him. If you’re based in the Bay Area, use bast4002.
 * The default name for SSH keys is id_rsa, but wmf_rsa, labs_rsa, and gerrit_rsa are easy to remember if you created different pairs (one for production, one for CloudVPS, and one for Gerrit)
 * Last entry lets you connect to stat1007.eqiad.wmnet via just “ssh stat7” for convenience. You can also make ones for stat1004 and stat1006.

Example config for CloudVPS
From this documentation: Host *.wmflabs User bearloga ProxyJump primary.bastion.wmflabs.org:22

Host primary.bastion.wmflabs.org User bearloga IdentityFile ~/.ssh/labs_rsa

Example config for SWAP
If you add the following lines to your ~/.bash_profile: Then you’ll be able to just type  (or  ) in Terminal to establish an SSH tunnel to notebook1003 (or notebook1004). Then just go to  in your favorite browser to start using Simple Wikimedia Analytics Platform (SWAP).

Bash configuration
On stat100X, make a ~/.bash_profile with the following: Then make a ~/.bashrc with the following: This will enable you to access external resources (e.g. if you’re installing R packages from CRAN) from internal machines through the HTTP proxy.

Caveat is that if you need to access internal HTTP resources (e.g. accessing Druid through curl), you will need to  first.

Data Sources
Most of the data we work with is in Hadoop, maintained by Analytics Engineering (AE).

Databases in Hive
HiveQL Language Manual (especially the UDFs page) will be your new BFF (if it’s not already). If you need faster querying you can use Spark. The following are the major DBs/tables:

wmf

 *   &   (for Page Previews)
 * traffic (as Page View counts) for individual pages, across our ~1000 wikis
 * includes:
 * parsed  strings (e.g. browser info, OS info)
 * IP-based geo-location (via our licensed MaxMindDB) down to city-level
 * “none” for direct traffic
 * “internal” for when user visited one of our pages from another one of our pages
 * “external” (when they came to us from another website)
 * “external (search engine)” when they came to us from Google, Bing, etc.
 * for more tables related to traffic data in the Data Lake, refer to this page
 * monthly snapshots, usually available within a week of the new month
 * contains all kinds of refined editing activity, including:
 * indicators when a revision has been reverted
 * revision change tags
 * indicator of when user was created by system (e.g. when you visit a wiki while logged-in on one wiki, an account specific to that wiki is auto-created for you; there is no such thing as a “global ID number”)
 * historical usernames, page titles, namespaces
 * indicator of whether the revision was deleted as part of the page getting deleted
 * for more details on edit data in the Data Lake, refer to this page
 * contains refined HTTP logs, refined using this codebase & this pipeline
 * includes: parsed UA, geolocation, IP addresses, ISP data, detailed referrer data, HTTP & cache status codes, content type (e.g. media files), full URI
 * for more details on the columns, refer to this page
 * contains refined HTTP logs, refined using this codebase & this pipeline
 * includes: parsed UA, geolocation, IP addresses, ISP data, detailed referrer data, HTTP & cache status codes, content type (e.g. media files), full URI
 * for more details on the columns, refer to this page

wmf_raw

 *   is users’ searches, using the built-in search on Wikimedia projects

event
EventLogging (EL) is a platform for modelling, logging, and processing analytic data. EL schemas are (currently) specified on Meta wiki in the Schema namespace, then engineers instrument the client-side implementation to send events to the EL endpoint. Those events contain fields defined in the schema, as well as metadata like User-Agent.

EL data is processed and stored in the event database, with a table for each active schema. These tables are partitioned by year, month, day, and hour. The User-Agent strings are parsed into structs, making it possible to query (for example): The event column is also a struct and contains the fields specified by the schema.

There is also event_sanitized, which contains sanitized EL data older than 90 days per our Data Retention Guidelines and Privacy Policy. Refer to this documentation for more details, including how to whitelist schemas and fields for retention (and hashing where appropriate).

MediaWiki Replicas
While AE’s plan is to eventually have all of MediaWiki (MW) edit data available in the Data Lake, the road there is dark and full of terrors. We only recently (starting with the April 2019 snapshot) started to have change tags for revisions, for example, so in the meantime if there is data on editing/editors you need and it’s not in the Data Lake or you can’t wait for the next snapshot to become available, you can connect to the MW replicas to query the latest edit data.

The database layout is described mw:Manual:Database layout.

Legacy MariaDB EventLogging
You probably won’t need to query the MariaDB EL database as all of EL data now flows into Hadoop/Hive (see event section above) and only whitelisted schemas also appear in MySQL. For more information, see this documentation.

However, there is still the use-case of beta testing client-side analytics. AE maintains a WMCS-hosted VM for testing, which is documented here.

Mailing lists
There are a few types of mailing lists: Google Groups and lists.wikimedia.org. Some lists which may be relevant to your work include:


 * Analytics
 * Engineering
 * Research-Internal
 * Wiki-research-l

For more, refer to mailing lists page on Office wiki.

Internet Relay Chat (IRC)
If you are a full-time employee, you can request Office IT to give you an IRCCloud account. Otherwise, you can use LimeChat (for macOS) or Pidgin (for Linux) clients for IRC. The following channels may be relevant to your work:


 * #wikimedia-analytics
 * #wikimedia-cloud
 * #wikimedia-dev
 * #wikimedia-office
 * #wikimedia-operations
 * #wikimedia-research
 * #wikimedia-sre
 * #wikimedia-staff

For more information, refer to IRC page on Office wiki.