Product Analytics/Onboarding

Welcome to the Product Analytics team in Wikimedia Product! This document is meant to get you up and running as quickly as possible. It covers a slew of topics, including an overview of various stores of data and how to access them securely. If you encounter something not clear or if you notice something major in your onboarding that is missing from this document, please don’t hesitate to ask for clarification, suggest a change, or edit this page so we can continue to improve this for future folks joining the team.

Access and security
To begin your Onboarding journey and for requesting access you will need to create accounts on the following websites:


 * MediaWiki: is home to the documentation for MediaWiki software and many of the teams at the Foundation, including ours. This account can be used on other Wikimedia projects like Meta wiki (which is where some teams document their work and processes), although we recommend you use a separate, personal account for volunteer work like editing Wikipedia.
 * Wikitech: is the home of documentation related to the technical projects and infrastructure maintained by the Foundation. This includes production systems, the Wikimedia Cloud Services computing infrastructure (virtual private server (VPS)), the Toolforge hosting environment (platform as a service (PaaS)), and many other technical projects such as the Beta cluster which is used to validate new software before deployment to the production cluster. This is your LDAP account.
 * Phabricator: is a collaboration platform open to all of Wikimedia. It is used for managing work related to software projects and is going to be the application using which work items will be tracked and assigned to you. You can create an account using your MediaWiki account created as a part of Step 1. Once you have a Phabricator account feel free to Learn how to use Phabricator and to visit the Product-Analytics group and our Workboard. Some Phabricator tasks have a restricted view due to the presence of confidential data such as IP addresses in the task description, comments or query results. In order to get access to Restricted tasks please open a Phab ticket to get added to the WMF-NDA project. You must add WMF-NDA-Requests as subscriber as shown in this sample ticket.

Once this is complete, let's look at some of the technical tasks you will have to complete for the Onboarding...

SSH Keys, Stat machines, Notebooks, HUE, Datagrip, Groups
You'll need to generate two SSH key-pairs – ED25519 (or 4096-bit RSA) specifically; refer to these instructions for generating those types of keys – one for production use (accessing our analytics machines and Jupyter notebook service "SWAP") and one for Wikimedia Cloud Services use (if you need to create remote VMs). See this page for more info. You'll need to file a Phabricator task (also called ticket) tagged as SRE-Access-Requests where you will:


 * provide the public key of the production pair (e.g.  if wmf_rsa is the private key), and
 * request to be added to  &   groups.

If you are a Foundation full-time employee, specify that you need to be added to the  group.

If you are a contractor, specify that you need to be added to the  group. In addition to the NDA you signed for Legal, you'll also need to digitally sign Acknowledgement of Wikimedia Server Access Responsibilities document for that ticket. Your manager needs to comment on that Phabricator task, stating that they are your manager and that they approve this request.

Here's additional information on this and example of a sample Phabricator ticket provided by the SRE.

Disclaimer: There are no service-level agreements or SLAs defined by the SRE-Access-Requests team for processing access request tickets submitted by employees or contractors. This could mean that in general, access requests could take more time than anticipated and especially if there is a delay by another team (such as a delay in getting the signed NDA from Legal).

Once you are added to  and  /  groups, you should be able to login to Hue (a web UI for exploring/querying our Data Lake) using your LDAP info (the username & password you registered on Wikitech aka your Wikimedia Developer Account, which is separate from your MediaWiki account). Be sure to link your LDAP, aka your Wikimedia Developer Account, and your MediaWiki account. This can be done in the settings area inside wikitech, where you'll find an area to configure external accounts. If you are not able to log in to HUE, open a Phabricator ticket tagged to SRE-Access-Requests, Analytics, Operations and request an admin from the Analytics Engineering team to sync your LDAP (wikitech username) account with your HUE account. You can also send a message on the #wikimedia-analytics node of IRC to get help from an admin member for this.

Hue is nice when you’re prototyping a query and then you can switch to running the query with hive CLI, R (via wmf package maintained by Mikhail) or Python (via wmfdata package maintained by Neil), either while SSH'd to stat1006/stat1007 or in a Jupyter notebook on Simple Wikimedia Analytics Platform (SWAP). You should also be able to use the same LDAP login info to access Turnilo and Superset, which are tools to visualize & dashboard the data stored in Druid data store.

DataGrip by JetBrains is another tool that can be used to access and query the databases. It is easy to use, quick to install & setup and also a great alternative to Hue. Please check out this page for further instructions on how to install and access DataGrip.

Yubikeys and VPN
See this Office wiki page regarding VPN, if you need to have a secured connection on a public Wi-Fi spot. Office IT will issue you a Yubikey for connecting to the VPN, which you can also use for PGP encryption/decryption. (Contractors don’t have access to Office wiki, so email Office IT if you require VPN.) Note: Office IT may be moving away from VPN. Check with them to find out whether they are issuing VPNs or are recommending that new staff/contractors subscribe to a service like Private Internet Access separately (which can be reimbursed).

If you turn on two-factor authentication ("2FA") on your staff Google account (which you should also do on your personal Google account if you haven't already), you can also use that same Yubikey to pass the 2nd step.

Miscellaneous
It's recommended that you:


 * learn enough wikitext (if you don't know some already) to edit/create pages and link across wikis


 * check out Product Analytics' on-wiki data analyses and reports

For WMF staff:


 * put some info in your User page on Meta wiki (just that one place and then it's used across all wikis); see the source for User:Neil P. Quinn-WMF for an example
 * this is also where you can list which languages you speak & fluency levels
 * add your contact info to the contact list on Office wiki (if applicable)

For contractors:


 * put some info in your User page on MediaWiki; see the source for User:Neil P. Quinn-WMF for an example

IT Equipment:

For any help with IT equipment or communication access, please feel free to reach out to Office IT: techsupport@wikimedia

Data and Project Backups
If you want to store your projects on GitHub, we have an organization wikimedia-research that Mikhail can add you to. Just let him know your GH username. Here are some examples of repositories: SDoC metrics, movement-level Editing metrics, iOS metrics, SEO sameAs A/B test report, and analysis for Search Platform.

Please be careful with what data you commit and push so you don't accidentally upload PII and other private data. Somewhat related, a lot of WMF code (e.g. MediaWiki software/extensions and Analytics' Refinery) is on Gerrit. See this page for some resources. A few teams use GitHub instead (e.g. both mobile apps teams). It’s recommended that you generate an SSH keypair specifically for Gerrit. When you do, add the public key on the Settings page.

Example config for production
Host * ForwardAgent no   IdentitiesOnly yes

Host bast HostName bast1002.wikimedia.org #       ^ from https://wikitech.wikimedia.org/wiki/Bastion User bearloga

IdentityFile ~/.ssh/wmf_rsa

Host *.wmnet User bearloga IdentityFile ~/.ssh/wmf_rsa ProxyCommand ssh -a -W %h:%p bast

Host stat7 HostName stat1007.eqiad.wmnet User bearloga IdentityFile ~/.ssh/wmf_rsa ProxyCommand ssh -a -W %h:%p bast

Notes:
 * 1) Host gerrit.wikimedia.org
 * 2)     Hostname gerrit.wikimedia.org
 * 3)     IdentityFile ~/.ssh/gerrit_rsa


 * Fields to customize:  and
 * Mikhail (bearloga) connects through bast1002 in Virginia since that’s the bastion that's closest one to him. If you’re based in the Bay Area, use bast4002.
 * The default name for SSH keys is id_rsa, but wmf_rsa, labs_rsa, and gerrit_rsa are easy to remember if you created different pairs (one for production, one for CloudVPS, and one for Gerrit)
 * Last entry lets you connect to stat1007.eqiad.wmnet via just “ssh stat7” for convenience. You can also make ones for stat1004 and stat1006.

Example config for CloudVPS
From this documentation: Host *.wmflabs User bearloga ProxyJump primary.bastion.wmflabs.org:22

Host primary.bastion.wmflabs.org User bearloga IdentityFile ~/.ssh/labs_rsa

Example config for SWAP
If you add the following lines to your ~/.bash_profile: Then you’ll be able to just type  (or  ) in Terminal to establish an SSH tunnel to notebook1003 (or notebook1004). Then just go to  in your favorite browser to start using Simple Wikimedia Analytics Platform (SWAP).

Example config for GitHub
To allow notebook machines to connect to an external site and do so via http, please use git clone. Open each notebook terminal and create .gitconfig file. Type the following into each document: [url "https://[username]@github.com"] insteadOf = https://github.com [credential] helper = cache --timeout=28800 [user] email = [Wikimedia email id] name = [GitHub username] *Replace [username] with your own GitHub username and the email id that was used to register to GitHub.

Bash configuration
On stat100X, make a ~/.bash_profile with the following: Then make a ~/.bashrc with the following: This will enable you to access external resources (e.g. if you’re installing R packages from CRAN) from internal machines through the HTTP proxy.

Caveat is that if you need to access internal HTTP resources (e.g. accessing Druid through curl), you will need to  first.

Data Sources
Most of the data we work with is in Hadoop, maintained by Analytics Engineering (AE).

Databases in Hive
HiveQL Language Manual (especially the UDFs page) will be your new BFF (if it’s not already). If you need faster querying you can use Spark. The following are the major DBs/tables:

wmf

 *   &   (for Page Previews)
 * traffic (as Page View counts) for individual pages, across our ~1000 wikis
 * includes:
 * parsed  strings (e.g. browser info, OS info)
 * IP-based geo-location (via our licensed MaxMindDB) down to city-level
 * “none” for direct traffic
 * “internal” for when user visited one of our pages from another one of our pages
 * “external” (when they came to us from another website)
 * “external (search engine)” when they came to us from Google, Bing, etc.
 * for more tables related to traffic data in the Data Lake, refer to this page
 * monthly snapshots, usually available within a week of the new month
 * contains all kinds of refined editing activity, including:
 * indicators when a revision has been reverted
 * revision change tags
 * indicator of when user was created by system (e.g. when you visit a wiki while logged-in on one wiki, an account specific to that wiki is auto-created for you; there is no such thing as a “global ID number”)
 * historical usernames, page titles, namespaces
 * indicator of whether the revision was deleted as part of the page getting deleted
 * for more details on edit data in the Data Lake, refer to this page
 * contains refined HTTP logs, refined using this codebase & this pipeline
 * includes: parsed UA, geolocation, IP addresses, ISP data, detailed referrer data, HTTP & cache status codes, content type (e.g. media files), full URI
 * for more details on the columns, refer to this page
 * contains refined HTTP logs, refined using this codebase & this pipeline
 * includes: parsed UA, geolocation, IP addresses, ISP data, detailed referrer data, HTTP & cache status codes, content type (e.g. media files), full URI
 * for more details on the columns, refer to this page

wmf_raw

 *   is users’ searches, using the built-in search on Wikimedia projects. NOTE: this dataset has been deprecated; use `event.cirrussearch_request` in Hive.

event
EventLogging (EL) is a platform for modelling, logging, and processing analytic data. EL schemas are (currently) specified on Meta wiki in the Schema namespace, then engineers instrument the client-side implementation to send events to the EL endpoint. Those events contain fields defined in the schema, as well as metadata like User-Agent.

EL data is processed and stored in the event database, with a table for each active schema. These tables are partitioned by year, month, day, and hour. The User-Agent strings are parsed into structs, making it possible to query (for example): The event column is also a struct and contains the fields specified by the schema.

There is also event_sanitized, which contains sanitized EL data older than 90 days per our Data Retention Guidelines and Privacy Policy. Refer to this documentation for more details, including how to whitelist schemas and fields for retention (and hashing where appropriate).

MediaWiki Replicas
While AE’s plan is to eventually have all of MediaWiki (MW) edit data available in the Data Lake, the road there is dark and full of terrors. We only recently (starting with the April 2019 snapshot) started to have change tags for revisions, for example, so in the meantime if there is data on editing/editors you need and it’s not in the Data Lake or you can’t wait for the next snapshot to become available, you can connect to the MW replicas to query the latest edit data.

The database layout is described mw:Manual:Database layout.

Legacy MariaDB EventLogging
You probably won’t need to query the MariaDB EL database as all of EL data now flows into Hadoop/Hive (see event section above) and only whitelisted schemas also appear in MySQL. For more information, see this documentation.

However, there is still the use-case of beta testing client-side analytics. AE maintains a WMCS-hosted VM for testing, which is documented here.

Mailing lists
There are a few types of mailing lists: Google Groups and lists.wikimedia.org. Some lists which may be relevant to your work include:


 * Analytics
 * Engineering
 * Research-Internal
 * Wiki-research-l

For more, refer to mailing lists page on Office wiki.

If you are a contractor, you may not have access to these mailing lists. However, you should ask your manager to be added to the 'product-analytics@wikimedia' email group list. For collaboration, you can share your calendar with specific team members using the instructions on this page and also request team members to share their calendars with you. In order to book meeting rooms in the office, please reach out to techsupport@wikimedia for help with this.

Internet Relay Chat (IRC)
If you are a full-time employee or contractor, you can request Office IT to give you an IRCCloud account. Otherwise, you can use LimeChat (for macOS) or Pidgin (for Linux) clients for IRC. If you are having trouble setting up IRC, please email Office IT at: techsupport@wikimedia

The following channels may be relevant to your work:


 * #wikimedia-analytics
 * #wikimedia-cloud
 * #wikimedia-dev
 * #wikimedia-office
 * #wikimedia-operations
 * #wikimedia-research
 * #wikimedia-sre
 * #wikimedia-staff

For more information, refer to IRC page on Office wiki.