Talk:Reading/Multimedia/Architecture/Tech Debt Backlog

= Fall 2013 Backlog = Discussion of issues for Fall 2013 backlog grooming.

Goal: find 4-6 big hitter problems that will give User:BDavis_(WMF) a roadmap to work on.

2013-08-23 Grooming Meeting

 * User:Bawolff, User:BDavis_(WMF), User:MarkTraceur, User:Fabrice_Florin_(WMF)
 * Discussed high level issues
 * Helped Fabrice understand terms "Varnish" and "Squid" and caching in general
 * Discussed difference between a feature and tech debt
 * User:Bawolff gave his top item list including relative priority:
 * Chunked upload. Session and general more reliability
 * Upload class (more extensible. The chunked uploading case is very ugly)
 * deadlocks and long held locks in filebackend / file class when moving/deleting/uploading file.
 * current image thumbnail should have timestamp in it for better caching clear
 * Partial failures in filebackend (If things fail, either nothing should happen or everything should happen. Sometimes pages get deleted but not the file, or vice veras)
 * Various swift layer improvements/thumbnailing cache (making thumbs cached at a different layer, maybe. Making content at an sha1 place and pretty urls only to user)

Copied from Multimedia/Architecture
Copied list and discussion from prior home on Multimedia/Architecture.


 * API upload is all kind of bad. everything under the directory /upload would be nice to rewrite.
 * though please be sure to test well with Upload Wizard and other API consumers, it's very easy to accidentally break edge cases (e.g. blacklist failures, duplicate checks, yadda yadda yadda) -- User:Eloquence
 * backlink invalidation in commons would be nice to fix (22390)
 * Redirect of old filename on file move would be nice. (for hotlinkers like the wmf blog, and due to lack of purging cache of commons client wikis)
 * this is hard to implement as a general MediaWiki feature without changing to run files through PHP by default, but could be done with our special file backends... -- User:Brion_VIBBER
 * Our use case is probably the most important. I don't think third parties use file redirects that much. Patch https://gerrit.wikimedia.org/r/#/c/80135/ -- User:Bawolff
 * versioned urls for thumbnails
 * +1 (or any other solution to outdated thumbnails getting cached) -- User:Kaldari
 * large file support in general - deadlocks abound
 * way FileRepo stores files is bad in general...requires pessimistic in general (see Requests for comment/Refactor on File-FileRepo-MediaHandler)
 * Why chunked upload should be fixed:
 * Upload actions that go through the job queue should be more reliable.
 * Suggest that long term stop using session data for this (people log in, log out, etc).
 * possible several pieces here: fix job queue to be sane, fix upload jobs to not (ab?)use session storage, and ? -- User:Brion_VIBBER
 * HTCP purges need active monitoring - DONE

Faidon via email; 2013-08-27
"An additional point I'd like to raise is technical debt surrounding the image scaler platform that I don't see mentioned there at all. The current way we do image scaling is crude and error-prone. See for example BZ #49118 but there are quite a few other short-comings arising from how the whole platform works (fire-and-forget launching of shell scripts). I think a designed-from-scratch service that spawns well-contained & well-monitored processes, failing gracefully and logging errors is long overdue but not unreasonably difficult to build, and I think it would fit well under the multimedia team's agenda."

IRC chat 2013-08-27
12:04	Aaron|home	I was thinking about what things are the most actionable 12:04	Aaron|home	bd808: bug 53400 might be an OK start 12:05	Aaron|home	basically, writeToDatabase should at least use onTransactionIdle and put the REPLACE in a callback/closure there 12:07	Aaron|home	bd808: in terms of RfC'ish stuff, the thumbnail coalescing isn't too bad of a place to dig into 12:07	Aaron|home	the amount of code change needed wouldn't be that huge, though some of it would be some small varnish module code 12:08	Aaron|home	bd808: the whole issue of large uploads bothers me because it relates to a bunch of problems that are hard to fix without rewriting everything (or horribly hacking around with job queue + persistent locks) 12:08	Aaron|home	we can make large uploads work better for the first stage of the pipeline (upload) though re-upload, move, delete, restore will still suck horribly 12:09	bd808	I'm not 100% sure, but I think roblaAWAY is open to major rewrite type projects 12:09	Aaron|home	that said, if videos tend to just be uploaded once and not changed, and it's badly wanted, it could be worth it I suppose 12:10	Aaron|home	well, there are different levels of "huge rewrites" ;) 12:11	Aaron|home	bd808: I think for someone new to MW, the thumbnail thing is a better place to get started rather than going down that rabbit whole just yet (which still scares me after all these years) 12:12	bd808	*listens to sage advice* 12:12	Aaron|home	of course, if the priority for the quarter was already decided, I guess you don't have much choice though ;) 12:12	paravoid	thumbnail coalescing? 12:13	Aaron|home	paravoid: whatever you call, fudging vcl_hash to group them for PURGEs 12:13	Aaron|home	maybe not using swift/ceph anymore for this and not having 7 copies of everything 12:13	paravoid	ah, that 12:13	paravoid	so, I was looking a bit at that in the past 12:13	bd808	I think the real priority at this point is "do things to make multimedia less sucky" 12:13	paravoid	remember the linear search issue? 12:14	Aaron|home	yes 12:14	bd808	but smoothing problems in the upload path seems to be a recurring theme 12:14	paravoid	so Tim was saying that he didn't expect this to be a huge problem, how many thumbs can a file have 12:14	paravoid	then you know what I pointed out? 12:14	paravoid	PDF and multi-page TIFFs 12:14	paravoid	1000-page PDF with 3-4 thumb sizes 12:15	paravoid	that's not uncommon at all 12:15	ori-l	heh 12:15	paravoid	there's a few wikis that use that a lot 12:15	Aaron|home	our djvu/pdf handling sucks too 12:15	Aaron|home	bd808: oh, wait, I told you that already 12:15	paravoid	arwikisource I think? 12:15	Aaron|home	like loading the whole text is metadata and slowing down category views 12:15	Aaron|home	only fixed the OOM aspect of that 12:15	ori-l	questions of the grammatical form "how many ___ could possibly ___..." are prayers to sauron 12:16	paravoid	but the solution could be handling pdf/tiff/djvu entirely differently 12:16	Aaron|home	paravoid: if nothing else, one could except by file extension and use the old system for those 12:16	paravoid	right 12:16	paravoid	heh 12:16	Aaron|home	and fix that crap later 12:16	paravoid	but yeah, this needs to be done with care 12:16	Aaron|home	we don't want to get caught in the spiderweb of having to redoing everything though, but breaks things into bits 12:17	paravoid	I'll leave that up to the people actually doing the work :) 12:17	paravoid	I'm merely pointing out the issue 12:17	Aaron|home	paravoid: sure 12:18	paravoid	but yeah, not having to store millions of tiny thumb files into media storage would be hugely appreciated 12:19	bd808	I'm of the naive opinion that "we" need to document the use cases and acceptance tests, evaluate current impl and design next-gen solution. 12:19	bd808	Then we need to figure out how to build that solution in smallish chunks 12:19	bd808	but I'm also talking out of my ass as to the specifics

Chunked Upload
This seems to show up on everyone's list. It looks to me like there are several sub-issues here:
 * Use of php user session to store data is problematic due for several reasons
 * There are open bugs related to file size: 36587, 51730
 * job queue instability
 * deadlocks

I haven't found much documentation of how this component is intended to work. There is a small blurb on API:Upload and some other info at commons:Commons:Chunked_uploads.

I guess my question here is where to start? Should I do a deep dive with an initial goal of documenting the current implementation so it can be reviewed by a larger group for architectural flaws or should I just poke at the edges with small fixes and not worry about the big picture?

--BDavis (WMF) (talk) 18:13, 26 August 2013 (UTC)


 * documenting this is seeming like a pretty good place to start to me. --BDavis (WMF) (talk) 17:45, 28 August 2013 (UTC)

Domain Name
Constraining multimedia operations to the domain upload.wikimedia.org would be preferable, as it helps carriers participating in Wikipedia Zero distinguish low-bandwidth UI ( .zero.wikipedia.org) access from normal/high-bandwidth UI ( .m.wikipedia.org) access. upload.wikimedia.org access attempts originating from .zero.wikipedia.org page in general are supposed to cause a prompt to the user for whether to proceed, whereas similar Wikipedia Zero access attempts originating from .m.wikipedia.org don't as a general rule require such a prompt. Some users have higher capability devices yet access zero.wikipedia.org because of, the thinking goes, advertisement of that domain in areas that tend to have lower bandwidth.

--ABaso(WMF) (talk) 11:33, 28 August 2013 (UTC)
 * I can't imagine we have any plans to move images away from upload.wikimedia.org. Its good to have it in a separate domain for same-origin policy paranoia.


 * What about non-images multimedia such as video and audio? If there's a different domain name, it would be ideal if that domain name is on the same load balancer of upload.wikimedia.org, at least. Alternatively, up-front establishment of the load balancer IPs for such a forthcoming multimedia cluster would be helpful, as we hope to communicate a firm set of IP addresses for in-scope Wikipedia Zero-related servers. --ABaso(WMF) (talk) 20:16, 10 September 2013 (UTC)

SVG
Some notes on svg issues. I really don't think these are that high priority (relative to some of the other issues), although they would make commons folk happy if a bunch of the rsvg issues were fixed. Bawolff (talk) 17:57, 28 August 2013 (UTC)
 * svg lang not working: https://gerrit.wikimedia.org/r/#/c/69027/
 * svg rendering bugs:
 * rsvg upstream is not very active. I don't really think anyone is actively working on it. If we want to fix these, we may be looking at sending patches to upstream ourselves
 * Most of the bugs in Wikimedia/SVG Rendering component lack a minimal test case, are not filed upstream, and sometimes don't even have a concise description. Cleaning these up, so that they are all filed upstream, have a minimal test case/steps to reproduce, are tested to make sure they are still present on latest rsvg, would probably be a good thing. (That's more clerical work than dev work. Maybe we could sucker some folks into doing that in some sort of public bug day. You don't need knowledge of programming to do that, however you do need knowledge of svg standard, which means many commons contributors are qualified).

2013-09-11 Meetup
Took advantage of Tech Days time to talk about these issues with Aaron and Faidon.

Faidon reminded me that his issues with the image scaler pipeline are not represented in the top issues list:
 * no failure metrics
 * no concurrency locking/checking to stop duplicating work
 * potential DOS due to low number of parallel operations available
 * limit.sh lacks reporting/metrics on memeory cap failures, etc

Aaron and Faidon both agreed that if SVG is on the list tiff, pdf and djvu should be as well. In general multi-page media types share a class of issues.

Thumbnail fixes are high on both of their lists. Probably number one importance. Major changes may effect datacenter buildout work that Faidon has in his work queue. Last image data copy took a month.

--BDavis (WMF) (talk) 22:45, 11 September 2013 (UTC)