User:KHarlan (WMF)/Blog/Diagnosing and fixing a Flow production error with the help of X-Wikimedia-Debug

Diagnosing and fixing production errors with the help of X-Wikimedia-Debug
January 16, 2019

The problem
In February 2018, a user reported that edits from Flow were not appearing in the Recent Changes feeds (IRCFeed, EventStreams) when regular users created topics on a new talk page. As noted in the comments for the issue, the problem was relatively major in that automated patrol systems could not see a significant portion of contributions made on Flow-enabled wikis.

What we usually do
When a bug like this occurs, there are typically three things to do. First, establish the steps to reproduce the bug. That was already done by the task creator and then confirmed by other contributors to the task. Second, attempt to reproduce the issue locally and set breakpoints in code to understand why the problem occurs. Third, check the production logs to look for any messages related to the action.

Unfortunately, we could not reproduce the problem in our local development environment. Nor did we find any relevant messages in the logs. Since reproducing the issue locally wasn't possible, we merged some diagnostic code but still had nothing. Early on, Stephane suggested a hypothesis that Flow's code checked for privileged user permissions, and then would fall back to checking a revision from the replica that didn't yet exist, but we could not reproduce this nor identify exactly where this might occur.

Enter X-Wikimedia-Debug
One of the incredibly useful tools we have available to us is the X-Wikimedia-Debug header. I knew about this header (and its browser extensions) from verifying changes that were being SWAT'ed into production but I had not considered its usefulness for tracking down production bugs. I was using the browser extension with the "Log" checkbox ticked (and still not finding anything useful in Logstash to help isolate this bug) when I realized that I could also profile the problematic request. When you check the box to profile a request, XHProf will profile the code that's executed and make the result available for viewing via Xhgui. Typically you do this to understand performance bottlenecks in your code, but as a bonus to using this tool, you also get a complete list of all functions executed during the request.

With this tool, I followed the steps to reproduce and then switched on the "Profile" option when I posted a new topic on an empty Flow board. Now, I had a profiled request which provided me with information on all the methods called, including which method called another. From here I could quickly see Stephane's hypothesis was correct, and I could also find the exact place where the code execution halted.

Assumptions about replication
With this knowledge, I went back to my local environment. I set up an environment with database replication and set up some breakpoints in the code I suspected was causing the problem. I then found that in, we had this code:

For a new topic on a blank flow board,  is , which would return true for privileged users. But for unprivileged users, Flow would check, and this evaluated as false because the code was querying the database replica, and the title did not exist there yet.

The submitted solution was to patch  to query the master DB when in the context of a POST request, since we know the title would exist in the master DB. With this patch in place, events were once again emitted properly and the bug was fixed.

Conclusion
A few of my conclusions from this experience:


 * Diagnostic code is helpful (even if it didn't pinpoint the problem here) and should be considered instead of "silent" returns
 * Having database replication in your local development environment can help catch issues while developing and when attempting to reproduce a production issue. One can use the Mediawiki-Docker-Dev environment for this, and see also how to adjust its database replication lag.
 * The profiler in the X-Wikimedia-Debug browser extension is really useful in helping narrow down and pinpoint the location of a production problem.