Fall 2013 Backlog
Discussion of issues for Fall 2013 backlog grooming.
Goal: find 4-6 big hitter problems that will give User:BDavis_(WMF) a roadmap to work on.
2013-08-23 Grooming Meeting
- User:Bawolff, User:BDavis_(WMF), User:MarkTraceur, User:Fabrice_Florin_(WMF)
- Discussed high level issues
- Helped Fabrice understand terms "Varnish" and "Squid" and caching in general
- Discussed difference between a feature and tech debt
- User:Bawolff gave his top item list including relative priority:
- Chunked upload. Session and general more reliability
- Upload class (more extensible. The chunked uploading case is very ugly)
- deadlocks and long held locks in filebackend / file class when moving/deleting/uploading file.
- current image thumbnail should have timestamp in it for better caching clear
- Partial failures in filebackend (If things fail, either nothing should happen or everything should happen. Sometimes pages get deleted but not the file, or vice veras)
- Various swift layer improvements/thumbnailing cache (making thumbs cached at a different layer, maybe. Making content at an sha1 place and pretty urls only to user)
Copied from Multimedia/Architecture
Copied list and discussion from prior home on Multimedia/Architecture.
- API upload is all kind of bad. everything under the directory /upload would be nice to rewrite.
- though please be sure to test well with Upload Wizard and other API consumers, it's very easy to accidentally break edge cases (e.g. blacklist failures, duplicate checks, yadda yadda yadda) -- User:Eloquence
- backlink invalidation in commons would be nice to fix (22390)
- Redirect of old filename on file move would be nice. (for hotlinkers like the wmf blog, and due to lack of purging cache of commons client wikis) bug 35721
- this is hard to implement as a general MediaWiki feature without changing to run files through PHP by default, but could be done with our special file backends... -- User:Brion_VIBBER
- versioned urls for thumbnails
- +1 (or any other solution to outdated thumbnails getting cached) -- User:Kaldari
- large file support in general - deadlocks abound
- way FileRepo stores files is bad in general...requires pessimistic in general (see Requests for comment/Refactor on File-FileRepo-MediaHandler)
- Why chunked upload should be fixed: bug 36587
- Upload actions that go through the job queue should be more reliable.
- Suggest that long term stop using session data for this (people log in, log out, etc).
- possible several pieces here: fix job queue to be sane, fix upload jobs to not (ab?)use session storage, and ? -- User:Brion_VIBBER
HTCP purges need active monitoring- DONE
Faidon via email; 2013-08-27
An additional point I'd like to raise is technical debt surrounding the image scaler platform that I don't see mentioned there at all. The current way we do image scaling is crude and error-prone. See for example BZ #49118 but there are quite a few other short-comings arising from how the whole platform works (fire-and-forget launching of shell scripts). I think a designed-from-scratch service that spawns well-contained & well-monitored processes, failing gracefully and logging errors is long overdue but not unreasonably difficult to build, and I think it would fit well under the multimedia team's agenda.
IRC chat 2013-08-27
12:04 Aaron|home I was thinking about what things are the most actionable 12:04 Aaron|home bd808: bug 53400 might be an OK start 12:05 Aaron|home basically, writeToDatabase() should at least use onTransactionIdle() and put the REPLACE in a callback/closure there 12:07 Aaron|home bd808: in terms of RfC'ish stuff, the thumbnail coalescing isn't too bad of a place to dig into 12:07 Aaron|home the amount of code change needed wouldn't be that huge, though some of it would be some small varnish module code 12:08 Aaron|home bd808: the whole issue of large uploads bothers me because it relates to a bunch of problems that are hard to fix without rewriting everything (or horribly hacking around with job queue + persistent locks) 12:08 Aaron|home we can make large uploads work better for the first stage of the pipeline (upload) though re-upload, move, delete, restore will still suck horribly 12:09 bd808 I'm not 100% sure, but I think roblaAWAY is open to major rewrite type projects 12:09 Aaron|home that said, if videos tend to just be uploaded once and not changed, and it's badly wanted, it could be worth it I suppose 12:10 Aaron|home well, there are different levels of "huge rewrites" ;) 12:11 Aaron|home bd808: I think for someone new to MW, the thumbnail thing is a better place to get started rather than going down that rabbit whole just yet (which still scares me after all these years) 12:12 bd808 *listens to sage advice* 12:12 Aaron|home of course, if the priority for the quarter was already decided, I guess you don't have much choice though ;) 12:12 paravoid thumbnail coalescing? 12:13 Aaron|home paravoid: whatever you call, fudging vcl_hash to group them for PURGEs 12:13 Aaron|home maybe not using swift/ceph anymore for this and not having 7 copies of everything 12:13 paravoid ah, that 12:13 paravoid so, I was looking a bit at that in the past 12:13 bd808 I think the real priority at this point is "do things to make multimedia less sucky" 12:13 paravoid remember the linear search issue? 12:14 Aaron|home yes 12:14 bd808 but smoothing problems in the upload path seems to be a recurring theme 12:14 paravoid so Tim was saying that he didn't expect this to be a huge problem, how many thumbs can a file have 12:14 paravoid then you know what I pointed out? 12:14 paravoid PDF and multi-page TIFFs 12:14 paravoid 1000-page PDF with 3-4 thumb sizes 12:15 paravoid that's not uncommon at all 12:15 ori-l heh 12:15 paravoid there's a few wikis that use that a lot 12:15 Aaron|home our djvu/pdf handling sucks too 12:15 Aaron|home bd808: oh, wait, I told you that already 12:15 paravoid arwikisource I think? 12:15 Aaron|home like loading the whole text is metadata and slowing down category views 12:15 Aaron|home only fixed the OOM aspect of that 12:15 ori-l questions of the grammatical form "how many ___ could possibly ___..." are prayers to sauron 12:16 paravoid but the solution could be handling pdf/tiff/djvu entirely differently 12:16 Aaron|home paravoid: if nothing else, one could except by file extension and use the old system for those 12:16 paravoid right 12:16 paravoid heh 12:16 Aaron|home and fix that crap later 12:16 paravoid but yeah, this needs to be done with care 12:16 Aaron|home we don't want to get caught in the spiderweb of having to redoing everything though, but breaks things into bits 12:17 paravoid I'll leave that up to the people actually doing the work :) 12:17 paravoid I'm merely pointing out the issue 12:17 Aaron|home paravoid: sure 12:18 paravoid but yeah, not having to store millions of tiny thumb files into media storage would be hugely appreciated 12:19 bd808 I'm of the naive opinion that "we" need to document the use cases and acceptance tests, evaluate current impl and design next-gen solution. 12:19 bd808 Then we need to figure out how to build that solution in smallish chunks 12:19 bd808 but I'm also talking out of my ass as to the specifics
This seems to show up on everyone's list. It looks to me like there are several sub-issues here:
- Use of php user session to store data is problematic due for several reasons
- There are open bugs related to file size: bugzilla:36587, bugzilla:51730
- job queue instability
I guess my question here is where to start? Should I do a deep dive with an initial goal of documenting the current implementation so it can be reviewed by a larger group for architectural flaws or should I just poke at the edges with small fixes and not worry about the big picture?
- documenting this is seeming like a pretty good place to start to me. --BDavis (WMF) (talk) 17:45, 28 August 2013 (UTC)
Constraining multimedia operations to the domain upload.wikimedia.org would be preferable, as it helps carriers participating in Wikipedia Zero distinguish low-bandwidth UI (<lang>.zero.wikipedia.org) access from normal/high-bandwidth UI (<lang>.m.wikipedia.org) access. upload.wikimedia.org access attempts originating from <lang>.zero.wikipedia.org page in general are supposed to cause a prompt to the user for whether to proceed, whereas similar Wikipedia Zero access attempts originating from <lang>.m.wikipedia.org don't as a general rule require such a prompt. Some users have higher capability devices yet access zero.wikipedia.org because of, the thinking goes, advertisement of that domain in areas that tend to have lower bandwidth.
- I can't imagine we have any plans to move images away from upload.wikimedia.org. Its good to have it in a separate domain for same-origin policy paranoia.
- What about non-images multimedia such as video and audio? If there's a different domain name, it would be ideal if that domain name is on the same load balancer of upload.wikimedia.org, at least. Alternatively, up-front establishment of the load balancer IPs for such a forthcoming multimedia cluster would be helpful, as we hope to communicate a firm set of IP addresses for in-scope Wikipedia Zero-related servers. --ABaso(WMF) (talk) 20:16, 10 September 2013 (UTC)
Some notes on svg issues. I really don't think these are that high priority (relative to some of the other issues), although they would make commons folk happy if a bunch of the rsvg issues were fixed.
- svg lang not working: https://gerrit.wikimedia.org/r/#/c/69027/
- svg rendering bugs:
- rsvg upstream is not very active. I don't really think anyone is actively working on it. If we want to fix these, we may be looking at sending patches to upstream ourselves
- Most of the bugs in Wikimedia/SVG Rendering component lack a minimal test case, are not filed upstream, and sometimes don't even have a concise description. Cleaning these up, so that they are all filed upstream, have a minimal test case/steps to reproduce, are tested to make sure they are still present on latest rsvg, would probably be a good thing. (That's more clerical work than dev work. Maybe we could sucker some folks into doing that in some sort of public bug day. You don't need knowledge of programming to do that, however you do need knowledge of svg standard, which means many commons contributors are qualified).
Took advantage of Tech Days time to talk about these issues with Aaron and Faidon.
Faidon reminded me that his issues with the image scaler pipeline are not represented in the top issues list:
- no failure metrics
- no concurrency locking/checking to stop duplicating work
- potential DOS due to low number of parallel operations available
- limit.sh lacks reporting/metrics on memeory cap failures, etc
Aaron and Faidon both agreed that if SVG is on the list tiff, pdf and djvu should be as well. In general multi-page media types share a class of issues.
Thumbnail fixes are high on both of their lists. Probably number one importance. Major changes may effect datacenter buildout work that Faidon has in his work queue. Last image data copy took a month.
2013-09-17 Backlog List Archive
- Chunked upload love:
- Stop using session data
- More reliable job queue
- Fix open bugs or know why we can't
- Thumbnail changes:
- more robust monitoring and logging of failures in generation
- versioned URLs to help stop cache problems
- varnish cache changes so we don't need to keep list of names
- store generated assets differently to reduce replica clutter
- Improve large file operations:
- allow rename without copy
- reduce lock contention
- examine queued operations possibility for things we can't make faster
- Improve SVG rendering:
- Lots of SVG bugs in Bugzilla
- Support for multilingual SVGs
- Make sure rsvg and fonts are up to date
- Consider adding more fonts for rendering support (possibly including non-free fonts)
- UploadWizard improvements: