Talk:Reading/Multimedia/Cache Invalidation Misses

Chat Logs
9:15:10 AM bawolff: Were you planning to try and tackle some of the monitoring aspects of that bug, or the bug itself? 9:15:41 AM bd808: Good question. robla said monitoring was needed but I'm game for watever 9:15:54 AM bawolff: The monitoring might be a better place to start 9:16:27 AM bd808: works for me. 9:17:03 AM bawolff: as bug 49362 proper is kind of in a "things don't always work, we don't know how often they don't work, and we don't know what to do about it" state 9:17:04 AM bawolff: Monitoring has a more clear path of what needs to be done 9:17:43 AM bawolff: So yeah, purging. I'm not sure how familar you are with Wikimedia's caching infrastructure, but I'm going to guess not that familar yet 9:18:04 AM bd808: You win on that. I'm a blank slate :) 9:18:43 AM bd808: I'm pretty good are reading code if you can point me to a good place to start unraveling 9:19:08 AM bd808: and give me a general idea of what you and robla are thinking about putting in place 9:19:51 AM bawolff: How thumbnails work on Wikimedia is basically: User requests at specific size -> Varnish server (layer 1) checks if it has that url cached -> layer 2 of varnish server checks if it is cached -> goes to image scalar, which checks if it is cached in swift (which is like looking to see if it exists locally) -> If not a php script is executed to resize the thumb from the original issue 9:21:00 AM bawolff: When something happens to the thumbnail to make it change, MediaWiki deletes the local copies of the thumbnail, and sends an HTCP purge request (which is a multicast udp packet) to the varnish servers 9:21:19 AM bawolff: This stage doesn't seem to always work 9:22:00 AM bawolff: and we really don't have any monitoring of it. Part of the problem is probably packet loss, as udp is unreliable, but we don't have any monitoring for the entire system falling on its face, which has happened before 9:22:53 AM bawolff: The code on the mediawiki side for sending these purges is in includes/cache/SquidUpdate.php 9:23:22 AM bd808: ok. so some accounting of packets sent vs packets received to watch for patterns? 9:24:11 AM bawolff: That would be ideal probably. However that might be complicated by the fact that there are many servers that send these packets 9:24:42 AM bawolff: I was thinking it might just be better to get a list of files that have recently been modified, fetch them, and check to see if they've been updated 9:25:34 AM bawolff: I guess there's two types of monitoring one might want - general monitoring of packet loss vs monitoring of does the entire system work at all 9:26:16 AM bawolff: In the past we've had situations where the servers had a wrong acl, and rejected all purges, and nobody noticed until an angry mob of users started yelling at us a couple days later 9:26:56 AM bd808: nver fun 9:27:00 AM bd808: *never 9:27:26 AM bawolff: There's a bug somewhere with basically 80 comments of users being angry 9:28:49 AM bd808: so any number of head end boxen could receive a changed file and request a purge of cache. and I'm guessing any number of caches could miss the purge order? 9:29:05 AM bawolff: Other things in the past that have happened, is that some of the varnish servers are in europe. There's a server that acts as a relay, reading the multicast packets, transmiting them to europe over unicast, and then another server at the other end re-assembles them, which has been a fragile in the past 9:29:09 AM bawolff: yeah basically 9:29:57 AM bd808: I suppose replacing the whole thing with 0mq topics would be way out of scope :) 9:30:34 AM bawolff: hey by all means if you want to replace it with something that doesn't suck... 9:30:38 AM bawolff: but yes probably 9:31:18 AM bd808: There seems to be a core architectural issue here in that the current system is designed to be async and lossy but users want eventually consistent behavior 9:31:43 AM bawolff: The original system dates back quite a while 9:32:09 AM bd808: So you can patch around it here and there but as long as it's multicast udp with no watchdog there will be holes 9:32:32 AM bawolff: indeed 9:33:01 AM bd808: but having a better idea of how pervasive the issue is gives input to priority for a replacement vs workarounds. 9:33:13 AM bawolff: For the non-image use case, users are generally not bothered by a missed packet now and again, but generally for images its a much bigger deal 9:33:31 AM bawolff: yes 9:34:25 AM bawolff: Longer term, the idea that's been floated is to change our multimedia architecture so that image urls have a timestamp in them, and then new versions would just have a different url, and we wouldn't need to worry about purging old versions of the cache for images 9:34:49 AM bd808: That would be much slicker for cache invalidation 9:35:06 AM bawolff: very much so 9:35:23 AM bd808: 2 hard things in CS: naming, cache invalidation and fence post errors 9:36:09 AM bd808: so a monitoring script to start with? 9:36:24 AM bd808: where do tools like that generally live? 9:36:52 AM bd808: or do you make the work happen randomly when something else is requested? 9:37:50 AM bawolff: We have labs for general scripts (including things from the community). The ops folks also have various monitoring tools hooked into their own system that I don't know much about 9:38:46 AM bawolff: Even if its not easy to get specific packet loss monitoring/super accurate monitoring, I still think even rough monitoring of "Is this totally broken to the user" would be useful 9:40:01 AM bawolff: I had some thoughts at https://bugzilla.wikimedia.org/show_bug.cgi?id=43449#c15. Brandon somewhat disagrees with me on how useful that type of monitoring would be. You may also want to chat with him to 9:41:14 AM bawolff: On the most basic side, just going to https://commons.wikimedia.org/wiki/File:Heloderma_suspectum_02.jpg?action=purge, waiting a couple seconds, and checking the age header on https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/Heloderma_suspectum_02.jpg/120px-Heloderma_suspectum_02.jpg 9:41:30 AM bawolff: would be a good first check to make sure things aren't totally broken 9:43:02 AM bd808: Does the `purge` verb require any special privileges or can it be thunked anonymously? 9:43:45 AM bawolff: purge (on the mediawiki side) requires you to be logged in to do it from a GET request 9:44:05 AM bawolff: if you're logged out, you need to go through a form 9:44:22 AM bawolff: We don't allow people to send PURGE http requests directly to the cache server 9:44:43 AM bd808: good. makes ddos a little harder 9:48:16 AM bd808: Is there any particular place that you guys use to write problem summaries/design notes? I'm thinking the first place to start here is making a wiki page explaining some of the issues and ideas. Then I can do some POC work on monitor pings to go with it 9:48:44 AM bawolff: Generally we use pages on MediaWiki.org 9:49:37 AM bawolff: Something like https://www.mediawiki.org/wiki/Multimedia/ 9:50:24 AM bd808: Excellent. 9:51:07 AM bd808: I'll start a writeup and show it to you before I shop it around for other input. Sound like a reasonable start? 9:51:19 AM bawolff: Sure. 9:52:14 AM bd808: One last n00b question. Is there a page you'd consider a good example of working through a problem of this nature? 9:53:02 AM bawolff: Umm, not sure off the top of my head 9:53:37 AM bd808: no worries. I'll feel it out. 9:53:55 AM bd808: thanks for your help. I'm sure I'll be back to ask more dumb questions 9:54:01 AM bawolff: I'm sure there is one somewhere (or many such pages, but I can't seem to find any examples) 9:55:28 AM bawolff: The other thing I should mention that might be useful, is we also give an incrementing id to the htcp packets. I think this was a past attempt at packet loss monitoring, but I don't know what happened to it 9:56:23 AM bawolff: but on the php side we still have https://git.wikimedia.org/blobdiff/mediawiki%2Fcore.git/c020b52748e462b973ef72861f413815e34cf647/includes%2Fcache%2FSquidUpdate.php 9:56:28 AM bd808: ah. which could give a cache an idea of how far it was behind (and catch up if there was a way to rewind in the stream) 9:58:36 AM bawolff: To be honest, I'm most familiar with the cache invalidation from the php side of things. The more on the ops side of stuff things get, the less I know 9:59:18 AM bd808: Would Brandon be a good contact for that side of the stack? 9:59:26 AM bawolff: I think so 9:59:52 AM bd808: cool. I'll reach out to him soon too then 10:00:00 AM bawolff: He wrote the deamon that we currently use to turn the multicast purges into things varnish understands 10:00:45 AM bawolff: (Squid understood the htcp packets, which is why we used them in the first place, but then we moved to varnish, which doesn't support it, so we have a bridge thingy on each varnish server to turn the htcp packet to an HTTP purge) 10:01:21 AM bawolff: Anyways, that's in the operations/software/varnish/vhtcpd repo if its useful --BDavis (WMF) (talk) 17:07, 30 July 2013 (UTC)