User:Zareenf/Outreachy Report

This page will include weekly progress reports for the Gnome Outreachy Program for User:Zareenf

Community Bonding Period[edit]

October 19, 2016: internship start date (based on start of daily check-ins)

Communication Plan[edit]

4x a week via IRC + 1 hangout per week (check in is usually 30-60 min) with Tilman
Bi-weekly hangout with Jon to discuss issues outside of day to day activities
Attend weekly research team meetings (get introduced to other WMF team members)

Relevant Links[edit]

phabricator task

Weekly reports during Community Bonding Period[edit]

Week of October 19 - October 21[edit]

started generating headers datasets
started brainstorming and implementing ways to check for data integrity (had to recreate most languages 3-4 times before they passed DQ checks)
worked on reading headers datasets into PAWS w/o crashing kernel

Week of October 24 - October 28[edit]

October 24 - October 25: Z off
worked with Yuvi to understand PAWS memory limitations
generated results on personal laptop to check memory limits
researched memory usage in pandas dataframes
brainstormed and tested way to filter rows by namespace = 0 (manipulating existing dataframe, creating new dataframe, etc.)
still generating and checking new headers file for data integrity
read Wikipedia traffic data and electoral prediction: towards theoretically informed models research paper to review for research blog

Week of October 31 - November 4[edit]

November 2 - November 4: Z off
tested generating a list of rows to skip, writing that list to a separate file and reading in the dataframe while skipping the rows where namespace != 0
still generating and checking new headers file for data integrity
wrote review of Wikipedia traffic data and electoral prediction: towards theoretically informed models research paper for research blog

Week of November 7 - November 11[edit]

November 9: Z off
implemented solution to read dataset in by chunks of 100,000 rows and changing dtypes to conserve memory
installed python 3 onto personal laptop to generate datasets on personal laptop
still generating and checking new headers file for data integrity

Week of November 14 - November 18[edit]

PAWS is down for 2 days
PAWS statics is down
worked on stripping whitespace w/o crashing PAWS kernel (tested command line tools like sed and awk)
IT and ES languages were complete (according to current understanding)
started writing up draft for IT and ES results
still generating and checking EN, FR, DE headers file for data integrity
realized logic flaw of only counting frequency of sections headings and not the number of articles in which that heading appears in so had to refactor the code to answer the correct question
realized unique article count in code vs. official WP statistics on article count varied drastically

Week of November 21 - November 25[edit]

November 23 - November 25: Thanksgiving holiday
analyzed Cree language to investigate article count difference
added counter of article pages while generating headings dataset and filtered by namespace 0 while generating dataset, updated to using 11/01 dump instead of 10/01, worked on results draft write up for meta page
still generating and checking new headers file for data integrity

Week of November 28 - December 2[edit]

confirmed that counter while generating datasets is better solution
still generating and checking new headers file for data integrity
started writing results draft into meta page (creating tables, translating other language words)
compressed datasets to get ready for publicly sharing

Week of December 5 - 6[edit]

Drafted village pump notice
Learned about public dataset best practices and uploaded to figshare
published 1st blog post about Outreachy internship

Weekly Reports during First Half of Internship Period[edit]

Week 1: December 6 - December 12[edit]

wrapped up subproject 0
wrote synopsis of results in meta page and created tables with results
posted announcements on 4 village pumps
completed retrospective for subproject 0 with mentor
pulled data for, created charts for, and wrote comments for my first Wikimedia Readership metrics report
investigated unusual peaks and dips in numbers for readership report
published 1st readership report
posted report to relevant mailing lists
attended CREDIT showcase meeting
started discussion about retention metric project for the following week’s work

Week 2: December 13 - December 19[edit]

started work on retention metric project
pulled all data for last 60 days (available timespan in database)
calculated an average in google sheets for each of the last 60 days
created chart of average
uploaded charts from Readership metrics report to Commons
presented subproject 0 at weekly research meeting (was cancelled the week earlier)
worked on modifying retention query to calculate results in hive rather than google sheets
wrote Outreachy blog post #2
watched mapreduce video tutorial + started reading a few chapters from Hadoop book (specifically Hive chapter)

Week 3: December 20 - December 26[edit]

December 26: off for holiday
continued modifying query for retention metric project
calculated median, 90th percentile and average for multiple days
worked on setting a fixed last seen date instead of fixed return date

Week 4: December 27 - January 2[edit]

December 28 + 30: off for holiday
continued work on retention metrics project
started pulling data for 2nd readership report

Week 5: January 3 - January 9[edit]

published 2nd readership report
uploaded charts to Commons
continued work on adjusting retention metric query
wrote blog post on Diving Into Wikimedia Foundation's Data

Week 6: January 10 - January 16[edit]

January 9: travel day to SF due to flight cancellation the day before
January 10-11: attended Wikimedia Developer Summit (blog post)
January 12-13: Z off (made these days up during the winter holiday break, WMF also had all hands meetings)
January 16: MLK day holiday, but Z + Tilman met up to work on retention metric project

Week 7: January 17 - January 23[edit]

set-up hive queries to run and get results emailed (incase ssh connection is dropped)
worked on modifying query to include different language editions, mobile vs. desktop
attended + spoke at Pyladies Data Science event at WMF (event page , video recording)
met with Leila from research team to discuss expanding stub pages research and how my section headings research can help

Week 8: January 24 - January 30[edit]

remotely attended Reading/Community tech Quarterly check-in
watched WMF Metrics and Activities Meeting
ran query to get average days until next access broken by mobile vs. desktop for all sites, English Wikipedia, Japanese Wikipedia, Italian Wikipedia and created charts with these results
worked with the analytics teams to discuss ways to use less resources while running large queries (sample data using TABLESAMPLE or create my own table in hive)
compared results from sampling to running queries over full data

Mid-term evaluation[edit]

I’m making solid progress on my project work, although I’m a bit behind schedule as the first projects took more time than originally estimated. I’ve gotten up to speed with various analytics tools and metrics used by the Reading team. I’ve completed subproject 0, worked on 2 readership reports for subproject 1 (and plan to release at least one more report during the internship), and have been working on subproject 4 (adjusted schedule of project work to prioritize most important projects). I attended the Wikimedia Developer Summit and got to meet many of the folks I worked remotely with and many members of the WMF technical community. I also gave a talk about “Becoming a Data Analyst” for a PyLadies meetup at the WMF office. The communication plan set in place has worked well to check in with my mentors regularly.

Weekly Reports during Second Half of Internship Period[edit]

Week 9: January 31 - February 6[edit]

continued work on running Hive queries in PAWS internal instead of via command line
worked with retention metric data: cleaned, sorted, and saved in PAWS internal
created Phabricator task to request additional functionality in PAWS internal to share + view notebooks
started generating data for next Readership Report
had bi-weekly check-in with Jon

Week 10: February 7 - February 13[edit]

finished pulling data for next Readership Report
Started draft of next Readership Report (updating charts, writing comments)
worked on creating charts in matplotlib and/or seaborn for retention project
updated timespan of wmf_last_seen_dates for retention project
worked with Joseph from Analytics to create my own intermediary table from wmf.webrequests to run queries for retention metric faster and with less strain on wmf servers
attended Q2 Reading Metrics meeting + weekly internal Research meeting

Week 11: February 14 - February 20[edit]

finished and published third Readership Report
started using data from zareen.webrequest_extract for retention metric and investigated 2 data quality issues (higher averages than with unsampled data and repeated date values)
modified query for average values due to aggregated rows (using view_count field)
started looking into how to set up a daily oozie job to update zareen.webrequest_extract table
created histograms (in google sheets) for number of days until next access, started work on moving this into PAWS internal
generated retention metric for all projects (with returns within 31 days and 7 days), broken by mobile vs desktop

Week 12: February 21 - February 27[edit]

worked on modifying percentile query to account for aggregated rows by creating sample table with values and weights to test queries with (also tested joins and finding cumulative sum)
started initial conversations about work for engagement project
attended monthly metrics and activities meeting

Week 13: February 28 - March 6[edit]

Worked on engagement metric project by checking and reporting on quality of data from log.ReadingDepth_16325045
Created the queries to calculate averages and percentiles from log.ReadingDepth_16325045 for engagement metric
Documented project's current status and wrote 2 blog posts (one about retention metric project and one final report)
Discussed future work on projects to close everything out in the next couple weeks

Final Report[edit]

Here is my final report blog post.