User:Memeht/Improving the Wikimedia Performance Portal

This is Christy Okpo's project proposal for Wikimedia's Performance portal.

Public URL: https://www.mediawiki.org/w/index.php?title=Improving_the_Wikimedia_Performance_Portal
Phabricator tasks: https://phabricator.wikimedia.org/maniphest/query/Z4x2ldvXN8Vi/#R
Announcement

http://www.gossamer-threads.com/lists/wiki/wikitech/517200search_string=FOSS%20Outreach%20Program%20for%20Women%20;#517200

Name and contact information[edit]

Name: Christy Okpo
Email: ecokpo@gmail.com
IRC or IM networks/handle(s): memeht
Location: San Francisco, CA, USA
Typical working hours: between 11 am and 2 am PDT (but generally flexible)

Micro-Tasks[edit]

Due to the fact that the original project specifications did not include a suggested micro-task, I took the initiative to submit a general patch (https://gerrit.wikimedia.org/r/#/c/70587/). As per the directions on the Project Outline, I studied the dashboards already present on GDash (https://gdash.wikimedia.org/) to understand metrics already being collected. Additionally, I did background research on the Graphite backend to G-dash and the internal Mediawiki C-Scripts that fed data into the Graphite system.

Further discussions with mentors, resulted in research on Performance Optimization KPI's and Grafana.

To further aid my understanding of key metrics, I wrote two scripts (Bash for batch downloading and Python for text analysis), to pull all messages from Wikitech mailing list archive from the last two years, to analyze for frequently discussed Mediawiki performance metrics.

Synopsis[edit]

This project will evaluate the performance timing data aggregated from the Wikimedia cluster, to select and construct several Key Performance Metrics.

First, research (Surveys, Mailing List text analysis, etc) will be performed to pinpoint frequently discussed metrics and benchmarks. Then, currently available data will be grouped, according to their relevance, into several categories: Speed, Scalability, Availability and Multi-device delivery. Within each category, datapoints will be analyzed and aggregated into Key Performance metrics. In turn these metrics will be visualized using the Grafana frontend.

At the end of the Project, a series of charts and visualizations will be available to be ported unto the Wikimedia Performance page <https://performance.wikimedia.org/>

Along the way, glossaries to help readers interpret the data, will be created.

This project will develop a dashboard of metrics that will allow users to, at-a-glance, understand the timing performance of Mediawiki and its ecosystem of services. It will provide a resource for system tuning, quick assessments of production readiness, and troubleshooting sources of performance problems. On a larger scale, it might give insights into ways to improve continued access to Wikimedia services, particularly to mobile-devices.

On a special note, particular attention will be paid towards compliance with the Wikimedia Foundation's Privacy Policy and the overall protection of user data. Wikimedia Foundation Privacy Policy: https://wikimediafoundation.org/wiki/Privacy_policy

Possible mentors: Ori Livneh, Nikolas Everett

Deliverables[edit]

A week before Program starts:

Set up machine, install necessary software, get Administration access (when needed).
Get Grafana up and running
Send out Survey Questions.

Week 1 through Week 5:

Parse Survey Questions.
Perform meta-research and evaluation of data-set to determine important metrics.
Transform one or more data-points into Key Performance Metrics (KPM) and Create data glossaries.
Present findings to mentor around Second Week of January.

Week 6 through Week 9: Working in two week sprints,

Iterate over Graphing/Visualization techniques for the selected metrics.
Create Batch Processes to keep metrics updated.
Submit necessary documentation and glosses.

Week 10 through to End:

Work on Porting visualizations to the Mediawiki performance portal.
Determine location for documentation and data glossies, and localization needs, if necessary.

| Progress Reports can be found here

Participation[edit]

I plan to maintain a weekly status and in-progress update, as a User sub-page, including links to any documentation or code created during that period. I plan to check in with my mentor, at least every third/fourth day of the week, or more frequently when help is needed. In addition to the mentor, I will utilize the Mediawiki IRC, wikitech-I and related mailing lists as well as currently available Mediawiki documentation.

In terms of work-style, I work best when I have the chance to get a big-picture view of a task, go off to do in-depth research, and return to ask questions about specific questions related to a task. In a way, I learn best with a mix of reflective and action-oriented mentoring. I like to know why things are done a certain way.

About you[edit]

Education completed or in progress: Dec 2012 : Eastern Michigan University : B.Sc in Economics, minor in Mathematics
How did you hear about this program?: I heard about the program at an SF meetup.
Will you have any other time commitments, such as school work, another job, planned vacation, etc., during the duration of the program?: Apart from December 25 & 26, December 31 & January 1st, I will be able to commit solely to this program.
We advise all candidates eligible to Google Summer of Code and FOSS Outreach Program for Women to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?: As I am not currently enrolled in University, I am not eligible for the Google Summer of Code program.

Other information

I deeply identify with the FOSS ethos. Wikimedia and its services have been instrumental in providing the kind of access that lies at the heart of FOSS.

I use Wikipedia on an hourly basis, and it has proved an immeasurable resource in getting a big picture overview of the subject I am researching.

System Performance is an important ingredient in providing such access, and my hope is that by contributing to this project, I can provide a resource for Wikimedia users/developers to understand and troubleshot any future performance bottlenecks.

Moreover, by curating/analyzing the performance timing datasets for quality and usefulness, I hope to contribute to making it accesible for reuse in other Wikimedia projects.

Past experience[edit]

Please describe your experience with any other FOSS projects as a user and as a contributor

I am very new to contributing to FOSS projects, but I have used a myriad of FOSS applications for quite some time, most notably VLC Media Player, Soundflower and Apache OpenOffice. I routinely work on audio-editing side-projects, VLC and Soundflower have been an asset on audio-conversion tasks. I have used Open Office - particularly Calc - in my data analysis work.

I have Ubuntu running on several virtual machines, and its ease of use and resources to customize the OS to my needs, is always a joy.

These FOSS Projects and others are what have really engineered my passion for working within the FOSS field.

Please describe any relevant projects that you have worked on previously and what knowledge you gained from working on them (include links)

As part of my ongoing contribution to the Code for San Francisco- Code for America Brigade, I have worked on data transformation scripts, written in Python. This skillset will be handy in creating batch processes, once the data has been grouped, to analyze the performance metrics.

I have approximately two years of professional experience as a Data Analyst, and so I have a strong foundation in Data Analytics and experience working with Python's Analytics Packages (Numpy, Matplotlib) and UNIX scripting.

Some of the relevant Data Analytics projects I have worked on in the past include Analyzing usage patterns on an online sales application using SQL/Excel/Python, creating custom dashboards for team performance analysis using Excel/Unix/Tableau. I will bring this knowledge to bear in evaluating datasets to develop Key Performance Metrics and Visualizations. In particular, my experience will serve as a guide to choosing visualizations that effectively engage viewers.

What project(s) are you interested in (these can be in the same or different organizations)?

I am most interested in working on the Wikimedia Performance Portal project with the Wikimedia Foundation.

Past experience working in open source projects (MediaWiki or otherwise)

I recently started contributing to the Code for San Francisco- Code for America Brigade. Specifically, I am working on a project to aggregate San Francisco service provider data from different City Departments, and transform it into Open Referral specification - an API specification. I am working on the Python scripts that will transform csv and other related data types in Open Referral-compliant JSON files. I have not pushed my code to the github parent repo, as some Schema mapping needs to be completed before I can do so. <https://github.com/sfbrigade/sf-openreferral-transform-scripts/issues>

I am also a member of Double Union - a female only Hackerspace in San Francisco. Although it is not directly an Open Source project, the space really works to promote Open Culture, and in particular, the inclusion of those underrepresented in FOSS-related activities. <https://www.doubleunion.org/>

Outreachy: