NOLA Hackathon 2011/Friday

Notes copied from the Etherpad for New Orleans 2011 hackathon, day 1 (Friday 14 October).

Non-MySQL support
Attending: Tim, DJ Brauch, Ben (MS), Brion, (?)

Per DJ who's been maintaining the MS SQL Server support, non-MySQL things are a lot stabler than they used to be! Continued work on PG and SQLite has helped.

Per Brion -- that's good news! Running our unit tests and such on several other DB backends should help detect breaks...

Test box on 1.16 using SQL server: http://mwazure2.cloudapp.net/ times out a lot, be warned ;)

SQL server working pretty well in 1.16 -- need to check on 1.17 state

Azure cloud -- coord w/ Ben @ MS to set us up with a test SQL server account so we don't have to worry about maintaining a Windows Server box for running regression tests.

(Try to select a cloud location that's close to FL or VA to minimize latency from our DCs.)


 * Mysql -- production-quality
 * Postgresql -- usually mostly works, but not regularly tested by core devs
 * Not working in a fresh install of MW 1.17
 * Sqlite -- used for various tests, but known to have limitations; updates mostly won't work?
 * Should be actively maintained. Ask MaxSem when he arrives
 * Mssql -- DJ Bauch actively maintaining; known good in 1.16, need to check 1.17
 * Oracle - Freakalowski actively maintaining; got disabled from installer in 1.17. Check status.
 * I think it was re-enabled for 1.18
 * yay!
 * Ibm_Db2 -- unknown status
 * RIP?

Swift media & stuff
Attending: Ben (WMF), Russ, Tim, Brion, H2mat kapil, Ben (MS), DJ, Salvatore, Kevin

(need TL;DR summary)

Background info:
 * Swift
 * Media server/FileBackend

1a) Short-term prep for using Swift for thumbnails

 * runs ms5 as direct backing for Swift
 * modify purge code to know about Swift (transitional & required; should take a few days)
 * temp hook in File::purgeAllThumbs etc?
 * can deploy Swift as a dumb cache in front of thumbs

Owners:
 * Aaron: purge code
 * Ben: ops work
 * Need to ask Russ what he has here

1b) More thumb stuff...

 * run image scalers as direct backing for Swift
 * move more of 404 handler's logic to thumb.php or similar (simplifies front-end; should take a few days)

Owners:
 * Ben: taking ms5 out of the loop
 * Tim: 404 handler
 * Russ: ?

2) Code refactoring to prep for real deployment (should take a couple months?)

 * refactor to create FileBackend
 * meta backend to save files to multiple backends (e.g. both old NFS and Swift) (needed for safe migration from NFS to Swift)
 * refactor object-store parts of SwiftBackend to a base ObjectStoreBackend class
 * -> can reuse logic for Switch, Azure, S3, etc

Owners:
 * Tim, Russ, Aaron?
 * Ben Lobaugh for Azure port of the refactored bits?

3) Do the real deployment for original files

 * use of the new backend classes
 * migration copying etc

Owners:
 * TBD
 * Ben & ops test and prep the actual deploy

Use 404 handler to push thumb.php logic into swift?

Note: other cloud storage systems will have similar properties, but... Swift has explicit provision for some middleware to do the 404 handling logic? (or is that a top layer?)

Current swift code copies temp files to local filesystem for thumbnailing, looking up properties etc. Can we do without having to copy locally?

When using "simple" cloud storage (eg Azure or S3's object stores) that doesn't have logic to redirect on 404, have to push the thumbnails ahead of time... or else use a front-end like thumb.php.

Create ahead of time (default MW install): Lazy creation (ugly URLs): Lazy creation (pretty URLs):
 * create thumb file if not already needed at reference time
 * output /image/thumb/blah.jpg links directly
 * output /thumb.php?... links
 * thumb.php proxies or else creates the thumbnail on demand
 * output /image/thumb/blah.jpg links directly
 * 404 handler on image web server calls back to thumb.php if needed, proxies the data

Two main parts of the immediate tasks:

Easier part:
 * refactor SwiftMedia store to a more general ObjectStoreStore
 * common code to handle "check this object out", then it goes to the existing MediaHandlers
 * basic operations: Media server/FileBackend
 * "write this blob"
 * "give me a (seekable) local file copy" -> needed to shell out etc
 * "read this blob as (unseekable) stream"
 * have Swift-specific code on top of that
 * makes it easy to add Azure, S3, etc support

Harder part: Additional tasks: Migration story? For prioritization:
 * refactor out code duplication between metadata & backend sections
 *  'FileBackend' 
 * reduce the amount of code that has to be done for the generic object store support
 * integrate the 404 handler logic (as on upload.wikimedia.org) into main wikimedia?
 * we mostly have plans for online migration, but need to flesh things out
 * how important is it to switch over in the immediate future? May help us prioritize. check with ops?
 * current servers get pretty bogged down for a few minutes if front-end squids go down
 * most of the disk i/o is thumbs on the core storage?
 * -> swift as front-end cache for thumbs may be an immediate aid

Deployment notes:
 * plan to deploy for thumbs now as a dumb cache
 * REQUIRED TASK: TEMP SCAFFOLDING CODE: write HTCP listener to detect cache purges and remove old versions from the Swift cache -- OR MediaWiki plugin to send Swift deletes directly at purge time (in File::purgeThumbList?)
 * set up replication between tampa & ashburn -- but DO NOT just make them one big cluster, this could end up with too much cross-traffic
 * find out what sort of replication lag there might be, if that causes limitations for failover switches etc
 * move the 404 handler's thumb URL decode logic into core -- makes the 404 handler logic in the swift cluster easier (look at old WebStore code -- has fallen a bit behind)
 * then do more general refactoring before doing a full migration

Failure plan?
 * can we switch back?
 * from thumbs cache -- easy
 * from full original files going into swift -- harder!
 * have Swift code also save files into an NFS archive in the same old directory tree? (breaks some abstraction, but meh)
 * can we do this as a meta store class?
 * use a separate audit log (that we could reconstruct from)?
 * job queue to copy? (considered not stable enough)

Documentation needs

 * You NEED to enable caching on your MediaWiki installation, or RL may make things a zillion times slower. -- says Chad
 * opcode caching (APC etc to avoid recompiling PHP over and over) or memcache-style caching? RL should be making active use of objectcache table (if not let's fix it), but every hit to load.php is slow without an opcode cache.

Slices of life
"Currently at NOLA Hackathon, Day 1 -- a group is learning/discussing/talking about git, another about SWIFT, and other folks are setting up dev envs (6:24pm)"

"Right now at the New Orleans hackathon: RoanKattouw_away talking with Salvatore about ResourceLoader 2 & Salvatore's AMICUS; MaxSem working on API sandbox; Mark Mims fixing juju bugs; Brion reviewing code; Kevin Brown getting his dev env up; Chad is "working on jenkins, still :)" (9:28pm)"