User:Zareenf/Outreachy Report

This page will include weekly progress reports for the Gnome Outreachy Program for User:Zareenf

Community Bonding Period
October 19, 2016: internship start date (based on start of daily check-ins)

Communication Plan

 * 4x a week via IRC + 1 hangout per week (check in is usually 30-60 min) with Tilman
 * Bi-weekly hangout with Jon to discuss issues outside of day to day activities
 * Attend weekly research team meetings (get introduced to other WMF team members)

Relevant Links

 * phabricator task

Week of October 19 - October 21

 * started generating headers datasets
 * started brainstorming and implementing ways to check for data integrity (had to recreate most languages 3-4 times before they passed DQ checks)
 * worked on reading headers datasets into PAWS w/o crashing kernel

Week of October 24 - October 28

 * October 24 - October 25: Z off
 * worked with Yuvi to understand PAWS memory limitations
 * generated results on personal laptop to check memory limits
 * researched memory usage in pandas dataframes
 * brainstormed and tested way to filter rows by namespace = 0 (manipulating existing dataframe, creating new dataframe, etc.)
 * still generating and checking new headers file for data integrity
 * read Wikipedia traffic data and electoral prediction: towards theoretically informed models research paper to review for research blog

Week of October 31 - November 4

 * November 2 - November 4: Z off
 * tested generating a list of rows to skip, writing that list to a separate file and reading in the dataframe while skipping the rows where namespace != 0
 * still generating and checking new headers file for data integrity
 * wrote review of Wikipedia traffic data and electoral prediction: towards theoretically informed models research paper for research blog

Week of November 7 - November 11

 * November 9: Z off
 * implemented solution to read dataset in by chunks of 100,000 rows and changing dtypes to conserve memory
 * installed python 3 onto personal laptop to generate datasets on personal laptop
 * still generating and checking new headers file for data integrity

Week of November 14 - November 18

 * PAWS is down for 2 days
 * PAWS statics is down
 * worked on stripping whitespace w/o crashing PAWS kernel (tested command line tools like sed and awk)
 * IT and ES languages were complete (according to current understanding)
 * started writing up draft for IT and ES results
 * still generating and checking EN, FR, DE headers file for data integrity
 * realized logic flaw of only counting frequency of sections headings and not the number of articles in which that heading appears in so had to refactor the code to answer the correct question
 * realized unique article count in code vs. official WP statistics on article count varied drastically

Week of November 21 - November 25

 * November 23 - November 25: Thanksgiving holiday
 * analyzed Cree language to investigate article count difference
 * added counter of article pages while generating headings dataset and filtered by namespace 0 while generating dataset, updated to using 11/01 dump instead of 10/01, worked on results draft write up for meta page
 * still generating and checking new headers file for data integrity

Week of November 28 - December 2

 * confirmed that counter while generating datasets is better solution
 * still generating and checking new headers file for data integrity
 * started writing results draft into meta page (creating tables, translating other language words)
 * compressed datasets to get ready for publicly sharing

Week of December 5 - 6

 * Drafted village pump notice
 * Learned about public dataset best practices and uploaded to figshare
 * published 1st blog post about Outreachy internship

Week 1: December 6 - December 12

 * wrapped up subproject 0
 * wrote synopsis of results in meta page and created tables with results
 * posted announcements on 4 village pumps
 * https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(miscellaneous)#Frequency_of_section_titles_in_English_Wikipedia
 * https://es.wikipedia.org/wiki/Wikipedia:Caf%C3%A9/Archivo/Miscel%C3%A1nea/Actual#Frequency_of_section_titles_in_Spanish_Wikipedia
 * https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Le_Bistro/8_d%C3%A9cembre_2016#Frequency_of_section_titles_in_French_Wikipedia
 * https://it.wikipedia.org/wiki/Wikipedia:Bar/2016_12_8#Frequency_of_section_titles_in_Italian_Wikipedia
 * completed retrospective for subproject 0 with mentor
 * pulled data for, created charts for, and wrote comments for my first Wikimedia Readership metrics report
 * investigated unusual peaks and dips in numbers for readership report
 * published 1st readership report
 * posted report to relevant mailing lists
 * attended CREDIT showcase meeting
 * started discussion about retention metric project for the following week’s work

Week 2: December 13 - December 19

 * started work on retention metric project
 * pulled all data for last 60 days (available timespan in database)
 * calculated an average in google sheets for each of the last 60 days
 * created chart of average
 * uploaded charts from Readership metrics report to Commons
 * presented subproject 0 at weekly research meeting (was cancelled the week earlier)
 * worked on modifying retention query to calculate results in hive rather than google sheets
 * wrote Outreachy blog post #2
 * watched mapreduce video tutorial + started reading a few chapters from Hadoop book (specifically Hive chapter)

Week 3: December 20 - December 26

 * December 26: off for holiday
 * continued modifying query for retention metric project
 * calculated median, 90th percentile and average for multiple days
 * worked on setting a fixed last seen date instead of fixed return date

Week 4: December 27 - January 2

 * December 28 + 30: off for holiday
 * continued work on retention metrics project
 * started pulling data for 2nd readership report

Week 5: January 3 - January 9

 * published 2nd readership report
 * uploaded charts to Commons
 * continued work on adjusting retention metric query
 * wrote blog post on Diving Into Wikimedia Foundation's Data

Week 6: January 10 - January 16

 * January 9: travel day to SF due to flight cancellation the day before
 * January 10-11: attended Wikimedia Developer Summit(blog post)
 * January 12-13: Z off (made these days up during the winter holiday break, WMF also had all hands meetings)
 * January 16: MLK day holiday, but Z + Tilman met up to work on retention metric project

Week 7: January 17 - January 23

 * set-up hive queries to run and get results emailed (incase ssh connection is dropped)
 * worked on modifying query to include different language editions, mobile vs. desktop
 * attended + spoke at Pyladies Data Science event at WMF (event page, video recording)
 * met with Leila from research team to discuss expanding stub pages research and how my section headings research can help

Week 8: January 24 - January 30

 * remotely attended Reading/Community tech Quarterly check-in
 * watched WMF Metrics and Activities Meeting
 * ran query to get average days until next access broken by mobile vs. desktop for all sites, English Wikipedia, Japanese Wikipedia, Italian Wikipedia and created charts with these results
 * worked with the analytics teams to discuss ways to use less resources while running large queries (sample data using TABLESAMPLE or create my own table in hive)
 * compared results from sampling to running queries over full data

Mid-term evaluation
I’m making solid progress on my project work, although I’m a bit behind schedule as the first projects took more time than originally estimated. I’ve gotten up to speed with various analytics tools and metrics used by the Reading team. I’ve completed subproject 0, worked on 2 readership reports for subproject 1 (and plan to release at least one more report during the internship), and have been working on subproject 4 (adjusted schedule of project work to prioritize most important projects). I attended the Wikimedia Developer Summit and got to meet many of the folks I worked remotely with and many members of the WMF technical community. I also gave a talk about “Becoming a Data Analyst” for a PyLadies meetup at the WMF office. The communication plan set in place has worked well to check in with my mentors regularly.

Week 9: January 31 - February 6

 * continued work on running Hive queries in PAWS internal instead of via command line
 * worked with retention metric data: cleaned, sorted, and saved in PAWS internal
 * created Phabricator task to request additional functionality in PAWS internal to share + view notebooks
 * started generating data for next Readership Report
 * had bi-weekly check-in with Jon

Week 10: February 7 - February 13

 * finished pulling data for next Readership Report
 * Started draft of next Readership Report (updating charts, writing comments)
 * worked on creating charts in matplotlib and/or seaborn for retention project
 * updated timespan of wmf_last_seen_dates for retention project
 * worked with Joseph from Analytics to create my own intermediary table from wmf.webrequests to run queries for retention metric faster and with less strain on wmf servers
 * attended Q2 Reading Metrics meeting + weekly internal Research meeting

Week 11: February 14 - February 20

 * finished and published third Readership Report
 * started using data from zareen.webrequest_extract for retention metric and investigated 2 data quality issues (higher averages than with unsampled data and repeated date values)
 * modified query for average values due to aggregated rows (using view_count field)
 * started looking into how to set up a daily oozie job to update zareen.webrequest_extract table
 * created histograms (in google sheets) for number of days until next access, started work on moving this into PAWS internal
 * generated retention metric for all projects (with returns within 31 days and 7 days), broken by mobile vs desktop

Week 12: February 21 - February 27

 * worked on modifying percentile query to account for aggregated rows by creating sample table with values and weights to test queries with (also tested joins and finding cumulative sum)
 * started initial conversations about work for engagement project
 * attended monthly metrics and activities meeting

Week 13: February 28 - March 6

 * Worked on engagement metric project by checking and reporting on quality of data from log.ReadingDepth_16325045
 * Created the queries to calculate averages and percentiles from log.ReadingDepth_16325045 for engagement metric
 * Documented project's current status and wrote 2 blog posts (one about retention metric project and one final report)
 * Discussed future work on projects to close everything out in the next couple weeks

Final Report
Here is my final report blog post.