Product Analytics/Onboarding

Welcome to Wikimedia Product! This document is meant to get you up and running as quickly as possible. It covers a slew of topics, including an overview of various stores of data and how to access them securely. If you encounter something not clear or if you notice something major in your onboarding that is missing from this document, please don’t hesitate to ask for clarification, suggest a change, or edit this page so we can continue to improve this for future folks joining the team.

Access and security
You'll need to generate two SSH key-pairs – ED25519 (or 4096-bit RSA) specifically; refer to these instructions for generating those types of keys – one for production use (accessing our analytics machines and Jupyter notebook service "SWAP") and one for Wikimedia Cloud Services use (if you need to create remote VMs). See this page for more info. You'll need to file a Phabricator ticket tagged as SRE-Access-Requests where you will:


 * provide the public key of the production pair (e.g.  if wmf_rsa is the private key), and
 * request to be added to  &   groups.

If you are a Foundation full-time employee, specify that you need to be added to the  group. If you are a contractor, specify that you need to be added to the  group.

In addition to the NDA you signed for Legal, you'll also need to digitally sign Acknowledgement of Wikimedia Server Access Responsibilities document for that ticket. Your manager needs to comment on that Phabricator task, stating that they are your manager and that they approve this request.

Once you are added to  and  /  groups, you should be able to login to Hue (a web UI for exploring/querying our Data Lake) using your LDAP info (the username & password you registered on Wikitech aka your Wikimedia Developer Account, which is separate from your MediaWiki account).

Hue is nice when you’re prototyping a query and then you can switch to running the query with hive CLI, R (via wmf package maintained by Mikhail) or Python (via wmfdata package maintained by Neil), either while SSH'd to stat1006/stat1007 or in a Jupyter notebook on Simple Wikimedia Analytics Platform (SWAP). You should also be able to use the same LDAP login info to access Turnilo and Superset, which are tools to visualize & dashboard the data stored in Druid data store.

Yubikeys and VPN
See this Office wiki page regarding VPN, if you need to have a secured connection on a public Wi-Fi spot. Office IT will issue you a Yubikey for connecting to the VPN, which you can also use for PGP encryption/decryption. (Contractors don’t have access to Office wiki, so email Office IT if you require VPN.) If you need to use a VPN when using public WiFi, you would need to subscribe to a service like Private Internet Access separately.

If you turn on two-factor authentication ("2FA") on your staff Google account (which you should also do on your personal Google account if you haven't already), you can also use that same Yubikey to pass the 2nd step.

= Miscellaneous = It's recommended that you:


 * put some info in your User page on Meta wiki (just that one place and then it's used across all wikis); see the source for User:Neil P. Quinn-WMF for example
 * this is also where you can list which languages you speak & fluency levels
 * add your contact info to the contact list on Office wiki (if applicable)
 * learn enough wikitext (if you don't know some already) to edit/create pages and link across wikis
 * check out Product Analytics' on-wiki data analyses and reports

If you want to store your projects on GitHub, we have an organization wikimedia-research that Mikhail can add you to. Just let him know your GH username. Here are some examples of repositories: SDoC metrics, movement-level Editing metrics, iOS metrics, SEO sameAs A/B test report, and analysis for Search Platform.

Please be careful with what data you commit and push so you don't accidentally upload PII and other private data. Somewhat related, a lot of WMF code (e.g. MediaWiki software/extensions and Analytics' Refinery) is on Gerrit. See this page for some resources. A few teams use GitHub instead (e.g. both mobile apps teams). It’s recommended that you generate an SSH keypair specifically for Gerrit. When you do, add the public key on the Settings page.

Example config for production
Notes:


 * Fields to customize:  and
 * Mikhail (bearloga) connects through bast1002 in Virginia since that’s the bastion that's closest one to him. If you’re based in the Bay Area, use bast4002.
 * The default name for SSH keys is id_rsa, but wmf_rsa, labs_rsa, and gerrit_rsa are easy to remember if you created different pairs (one for production, one for CloudVPS, and one for Gerrit)
 * Last entry lets you connect to stat1007.eqiad.wmnet via just “ssh stat7” for convenience. You can also make ones for stat1004 and stat1006.

Example config for CloudVPS
From this documentation:

Example config for SWAP
If you add the following lines to your ~/.bash_profile: Then you’ll be able to just type  (or  ) in Terminal to establish an SSH tunnel to notebook1003 (or notebook1004). Then just go to  in your favorite browser to start using Simple Wikimedia Analytics Platform (SWAP).

Bash configuration
On stat100X, make a ~/.bash_profile with the following: Then make a ~/.bashrc with the following: This will enable you to access external resources (e.g. if you’re installing R packages from CRAN) from internal machines through the HTTP proxy.

Caveat is that if you need to access internal HTTP resources (e.g. accessing Druid through curl), you will need to  first.

Data Sources
Most of the data we work with is in Hadoop, maintained by Analytics Engineering (AE).

MediaWiki Replicas
While AE’s plan is to eventually have all of MediaWiki (MW) edit data available in the Data Lake, the road there is dark and full of terrors. We only recently (starting with the April 2019 snapshot) started to have change tags for revisions, for example, so in the meantime if there is data on editing/editors you need and it’s not in the Data Lake or you can’t wait for the next snapshot to become available, you can connect to the MW replicas to query the latest edit data.

The database layout is described mw:Manual:Database layout.

Legacy MariaDB EventLogging
You probably won’t need to query the MariaDB EL database as all of EL data now flows into Hadoop/Hive (see event section above) and only whitelisted schemas also appear in MySQL. For more information, see this documentation. However, there is still the use-case of beta testing client-side analytics. AE maintains a WMCS-hosted VM for testing, which is documented here.

Mailing lists
There are a few types of mailing lists: Google Groups and lists.wikimedia.org. Some lists which may be relevant to your work include:


 * Analytics
 * Engineering
 * Research-Internal
 * Wiki-research-l

For more, refer to mailing lists page on Office wiki.

Internet Relay Chat (IRC)
If you are a full-time employee, you can request Office IT to give you an IRCCloud account. Otherwise, you can use LimeChat (for macOS) or Pidgin (for Linux) clients for IRC. The following channels may be relevant to your work:


 * #wikimedia-analytics
 * #wikimedia-cloud
 * #wikimedia-dev
 * #wikimedia-office
 * #wikimedia-operations
 * #wikimedia-research
 * #wikimedia-sre
 * #wikimedia-staff

For more information, refer to IRC page on Office wiki.