The Wikipedia Library/Wikilink stability

This page provides an overview of The Wikipedia Library team's Q1 20/21 project to improve the stability of the Wikilink tool and solve its major bugs.

While this project doesn't have many community-facing elements, comments, suggestions, and feedback are welcome on the talk page.

Background
The Wikilink tool, hosted at https://wikilink.wmflabs.org/, tracks and displays data about links added to specific websites. The tool has two groups of users - The Wikipedia Library team internally, and publisher partners externally. The Wikipedia Library team uses the data to evaluate the program and its active partnerships, report on key metrics, and understand the program's user base. Publishers want to understand the impact of the access they are providing editors so that they can evaluate whether to continue providing access to the community.

The tool was built in 2019 as an improvement to our previous process of manually collecting and reporting link totals. We were motivated to build a tool which could track link additions across all Wikimedia projects in real time, and attribute those link additions directly to The Wikipedia Library program. Deployment of the  EventStream (see page-links-change) allowed us to do this. The tool was built quickly and we weren't able to prioritise further development time until now. While data has continued to be collected, manual intervention is often required, and many pages on the tool don't load as expected.

We would like to get the tool to a point where data tracking is reliable and requires little manual intervention, and all program and organisation pages load quickly and consistently.

Roadmap
Our work on this project will centre around two issues, but will additionally include numerous bug fixes and interface improvements.


 * 1) T240673 - Program and organisation pages time out due to expensive database queries.
 * 2) T250084 - Keep the EventStream tracking script connected

See also the tasks listed at Wikilink-Tool. We will address as many of the highest priority tasks as possible given our timeframe.

Updates
Week starting...

3 August 2020
 * Team onboarding for the tool - Sam presented an overview of the tool and the tasks we'll be working on.
 * We updated https://phabricator.wikimedia.org/T258793, https://phabricator.wikimedia.org/T250084, and https://phabricator.wikimedia.org/T240673
 * We decided to start work on upgrading the Production server to Docker Swarm (T258793) and investigating the EventStream connectivity issues (T250084).

17 August 2020
 * Started work on upgrading the Production server to Docker Swarm (T258793) and implementing server error email alerts (T258144). This work will feed into the EventStream connectivity task.

24 August 2020
 * Continued work on the Docker Swarm upgrade.

31 August 2020
 * Began broader investigation for EventStream connectivity issues while awaiting review of Docker Swarm upgrade.
 * Investigation notes published at T250084
 * We published a series of next steps for the EventStream issues at T261807

28 October 2020
 * Over September we:
 * Implemented the Docker Swarm upgrade (T258793), which automatically reboots the EventStream container on exit
 * Upgraded Wikilink to Django 3.1 (T261798)
 * Updated the timeout configuration for the data collection script (T264211)
 * These upgrades should ensure consistent data collection moving forward.
 * We've now started work on the page load issues (T240673)

5 October 2020
 * The EventStream data collection has been stable. While disconnects have happened, the surrounding infrastructure has successfully rebooted without data loss.
 * Implemented Django-debug-toolbar to get a more granular view on the page load issues. Currently evaluating the best paths forward to improve page load times. (T240673)