Wikimedia Platform Engineering/Site performance and architecture

Rationale
Many small architectural changes and improvements are being done all of the time without a lot of fanfare. This is a general activity area where we communicate changes made along these lines.

April-June 2013

 * JobQueue improvements
 * Eqiad migration wrapup
 * Migrate fenari to tin.eqiad.wmnet
 * Migration to Ceph - still running sync scripts, possible split-brain issues with memcache
 * Migrate hume to terbium.eqiad.wmnet

Mysterious future
As yet unscheduled work for the (hopefully) near term:

Deployment sprint
We plan to put the items below in a deployment infrastructure sprint sometime between July and December 2013:

(All bugs listed in sortable BZ search)


 * git-deploy
 * tracking
 * - auditing salt scripts for completeness
 * - deal with dirty git fetches properly
 * Questions:
 * Will Platform take over maintenance of git-deploy?
 * monitoring
 * - Better 500 error/PHP exception monitoring
 * - create monitoring for the issue in that bug
 * Questions:
 * Where should this be stashed in our monitoring?
 * deployment script improvements
 * - Make updates atomic (e.g. symlink + directory move tricks or git-deploy?)
 * - Some improvements for the deployment scripts
 * - Reconciling the use of timestamps on Javascript files (rsync vs ResourceLoader vs git)
 * - resetUserTokens.php not usable on large wikis
 * Kill deployment hacks with fire - live hacks that are still applied as of 2013-05-16
 * multi-site awareness
 * - mwscript.php/mctest.php does not know about memcache in both datacenters
 * - migrate scripts from hume to terbium
 * Database config cleanup -- multisite awareness in MediaWiki
 * Beta related
 * - allowing extensions to be run from not master
 * Monitoring of betalabs?
 * on Ops to set that up, whether or not it actually alerts
 * vagrant on labs for quick dev environments? Probably Q4

Performance sprint
(All bugs listed in a sortable BZ search.)
 * - resetUserTokens.php not usable on large wikis
 * - Rewrite jobs-loop.sh in a proper programming language
 * - Separate Cache-Control header for proxy and client

Shell automation sprint
As yet completely unscheduled
 * - Enable importing across all Wikimedia projects

Documents

 * Task management: Bugzilla
 * Release management plan:
 * Communications plan: