User:KHarlan (WMF)/Blog/Diagnosing and fixing a Flow production error with the help of X-Wikimedia-Debug

The problem
In February 2018, a user reported that edits from Flow were not appearing in the Recent Changes feeds (IRCFeed, EventStreams) when regular users created topics on a new talk page. As Timo noted in the comments for the issue, the problem was relatively major in that automated patrol systems could not see a significant portion of contributions made on Flow-enabled wikis.

What we usually do
When a bug like this occurs, there are typically three things to do. First, establish the steps to reproduce the bug. In this case, that was already done by the task creator and then confirmed by other contributors to the task. Second, attempt to reproduce the issue locally and set breakpoints in code to understand why the problem occurs. Third, check the production logs to look for any messages related to the action.

Unfortunately, we could not reproduce the problem in our local development environment. Nor did we find any relevant messages in the logs. Since reproduce the issue locally wasn't possible, we merged some diagnostic code but still had nothing. Early on, Stephane suggested a hypothesis that Flow's code checked for privileged user permissions, and then would fall back to checking a revision from the replica that didn't yet exist, but we could not reproduce this nor identify exactly where this might occur.

Enter X-Wikimedia-Debug
One of the incredibly useful tools we have available to us is the X-Wikimedia-Debug header. I knew about this header (and its browser extensions) from verifying changes that were being SWAT'ed into production. I was using the browser extension with the "Log" checkbox ticked (and still not finding anything useful in Logstash to help isolate this bug) when I realized that I had a powerful aid available with this header: the ability to profile my request. When you check the box to profile a request, XHProf will profile the code that's executed and make the result available for viewing via Xhgui.

With this tool, I followed the steps to reproduce and then switched on the "Profile" option when I posted a new topic on an empty Flow board. Now, I had a profiled request which provided me with information on all the methods called, including which method called another. From here I could quickly see Stephane's hypothesis was correct.

Assumptions about replication
In, we had this code:

For a new topic on a blank flow board,  is , which would return true for privileged users. But for unprivileged users, Flow would check, and this evaluated as false because the code was querying the database replica, and the title did not exist there yet.

The submitted solution was to patch  to query the master DB when in the context of a POST request. With this patch in place, events were once again emitted properly and the bug was fixed.

Conclusion
A few of my conclusions from this experience:


 * Diagnostic code is helpful (even if it didn't pinpoint the problem here) and should be considered instead of "silent" returns
 * Having database replication in your local development environment can help catch issues while developing and when attempting to reproduce a production issue. One can use the Mediawiki-Docker-Dev environment for this, and see also how to adjust its database replication lag.
 * The profiler in the X-Wikimedia-Debug browser extension is really useful in helping narrow down and pinpoint the location of a production problem.