Extension:TimedMediaHandler/VP9 transition

I have started migrating Wikimedia's video playback transcodes from the older WebM VP8 and Vorbis codecs to the newer WebM VP9 and Opus codecs. The transition started July 30, 2018, and may take some weeks to complete for old files.

Big thanks to Wikimedia's Ops team for reassigning some servers that were freed up recently to help with this transition!

Also thanks to Microsoft for adopting the royalty-free, open-source VP9 and Opus codecs in MS Edge; as of the latest Windows 10 update they "just work" out of the box, as they do in Chrome and Firefox.

What does this mean?
The biggest difference is a change in bandwidth/quality trade-offs for the video files used for on-wiki playback. Many files will be about 30% smaller at similar quality, while others will be similar size or a bit larger but much better quality.

The filenames of the actual playback files will change slightly, from ".webm" to ".vp9.webm". Direct links to these transcoded playback files are discouraged; API clients should use videoinfo/derivatives or transcodestatus to get current URLs.

There is no change in support for uploaded file formats; existing original files are not being modified, and every file that works now will continue to work. URLs to existing original files will continue to work.

Encoding times could be about 2x times longer on average, depending on resolution and load.

Brower support
Browser support is almost the same as for the previous configuration.

WebM VP9/Opus videos are natively supported by:
 * Chrome and Chromium-based browsers (Opera, Brave, etc)
 * Firefox
 * Edge 17 and up (Windows 10 version 1803)

And are supported via the ogv.js player shim on:
 * Safari (up to 720p24 on fast machines)
 * Edge 16 and below (up to 720p24 on fast machines)
 * Internet Explorer 11 (up to 240p24 on fast machines; requires Flash for audio)

Note that the old Google WebM Components for IE did allow for native VP8/Vorbis playback in IE 11 if you manually installed it, but that package does not support VP9/Opus. If you require high-resolution video playback and are affected by this, the recommended upgrade path is to use another browser such as Chrome or Firefox in place of IE 11.

Server-side changes
Because VP9 encoding is more complex than VP8, the scaled and transcoded output files will take longer to produce than the old VP8 files did at the same CPU thread count. To compensate and to help along the transition, some additional machines have been reassigned to handle the specialized job queue for video transcoding. (These were previously running image thumbnailing until they were replaced by the newer thumbor system.)

The end-to-end conversion time for each file will get about 2x slower per thread, but overall times should stay close to the same as before by deploying updated libvpx and ffmpeg packages supporting macroblock-row-based multithreading first, allowing twice as many threads to be used.

Encoding changes
Additionally the encoder quality/speed/bandwidth settings are being changed to err on the side of quality with modest speed. Many files come out with much lower bandwidth than the current VP8 fixed-target-rate settings, while others with high frame rates, or lots of detail and motion, will be the same or even larger in exchange for much better picture quality.

This particularly affected 60fps files, which previously looked poor due to using less than half the recommended bitrate, or files with lots of detail and motion such as.

This is probably an appopriate trade-off for encyclopedic data, though we may be able to adjust the maximum bitrates later on.

Transition
Newly uploaded files will start producing VP9 output automatically.

Old files will continue to hold onto their VP8 versions, which will continue to play them until they get replaced.

A batch process (requeueTranscodes.php) will go through the backlog generating new VP9 versions to replace them. This process is throttled to minimize disruption to foreground operations.

Note to bot tool authors: please do not manually reschedule a large number of files for new VP9 transcodes! The batch process throttles itself but anything scheduled via the UI or API will run as soon as possible. Too many will clog up the queue.

The complete transition is expected to take several weeks, with a large margin of error, and should present minimal disruption.

Technical details
Filenames/URLs for old VP8 transcodes end in ".webm"; the new VP9 transcodes end in ".vp9.webm".

For instance this output file: https://upload.wikimedia.org/wikipedia/commons/transcoded/3/3e/Ailurus_fulgens.ogv/Ailurus_fulgens.ogv.480p.webm will be obsoleted by this one (does not yet exist): https://upload.wikimedia.org/wikipedia/commons/transcoded/3/3e/Ailurus_fulgens.ogv/Ailurus_fulgens.ogv.480p.vp9.webm

Video scaler provisioning notes
I have set $wgFFmpegThreads to 8, which'll provide huge benefits for HD and UHD resolutions.

IIRC the newly set up machines, and also the old machines, have dual 10-core/20-thread processors. In theory this means could crank the threads up to 40 (hyperthreading) for maximum parallelism when lightly loaded, however there are diminishing returns beyond 8 threads (for HD) or 2-4 threads (for SD and low resolutions). Due to the way row-mt threading in libvpx works, not all threads will be loaded at all times, so you won't see sustained 1600% CPU load from a 16-thread ffmpeg process (at full HD you'd expect to see more like 800%, lower resolutions more like 400% or 200%). Thus the configuration should overprovision from the point of view of $wgFFmpegThreads * the number of job runners per machine.

Due to increased thread count and often longer per-thread task running, we have increased $wgTranscodeBackgroundMemoryLimit and $wgTranscodeBackgroundTimeLimit by 2 or so.

Risks and contingency plans
Going forward with VP9 on the current Debian 9.x ffmpeg package would haved increase encoding time by 3-4x, which may cause timeouts for very long videos (conference talks, full-length films). We prioritized a backport of libvpx 1.7 to mitigate this (allows using more cores, which lets us stretch our legs on the 20-core/40-thread queue runners). Moritz completed the backport; TMH patch to allow using this mode once it's available has been merged.

The main remaining risk is that we'll uncover new bugs in the encoding process while running batch encodes, which could lead to clogging the queue runners.

Batch process for running the re-encoders is requeueTranscodes.php which will run on mwmaint1001 in a tmux session, and can be shut down by a root in case of emergency. Shutting down this process will prevent new encoding jobs from being enqueued for old uploads, but won't stop for new uploads.

If something goes awry with the transcoder processes, problems should be isolated to the set of machines running job queue runners for the WebVideoTranscode / WebVideoTranscodePrioritized queues. They can be shut down, reset, whatever as necessary if necessary.

If a major compatibility problem is found that requires rolling back to VP8, contingency plan is roughly:
 * comment out the $wgEnabledTranscodes section to restore VP8 defaults
 * create new VP8 .webm files for newly-uploaded files with requeueTranscodes.php
 * leave the .vp9.webm files where they are or delete them with requeueTranscodes.php
 * clear out the job queue manually if necessary

Deployment notes
Monday
 * initial deployment of new config went out ok Monday, July 30 2018

Tuesday
 * found two files converted overnight that tried to resize to 0x0, which failed. Rerunning manually worked. Keep an eye out for this failure mode.
 * throttle on requeueTranscodes.php is broken because production job queue doesn't report lengths back as expected (https://phabricator.wikimedia.org/T200813) Have a workaround ready to stage. In the meantime, a couple hundred transcodes are running. :)
 * handy grafana link for load monitoring!
 * throttle fix deployed
 * requeueTranscodes.php running on mwmaint1001!
 * (late night) encountered some failures (T200873), stopping batch process pending figuring out what was going on
 * errors appear to be related to hhvm config updates. continuing batch process, will clean up errors tomorrow.

Wednesday 2018-08-01
 * found a couple errors with the 'frames left over' bug; workaround is already in queue. scheduling to deploy later today