Talk:Reading/Multimedia/Cache Invalidation Misses

Private Chat between bawolff and bd808 on 2013-07-30
9:15:10 AM bawolff: Were you planning to try and tackle some of the monitoring aspects of that bug, or the bug itself? 9:15:41 AM bd808: Good question. robla said monitoring was needed but I'm game for watever 9:15:54 AM bawolff: The monitoring might be a better place to start 9:16:27 AM bd808: works for me. 9:17:03 AM bawolff: as bug 49362 proper is kind of in a "things don't always work, we don't know how often they don't work, and we don't know what to do about it" state 9:17:04 AM bawolff: Monitoring has a more clear path of what needs to be done 9:17:43 AM bawolff: So yeah, purging. I'm not sure how familar you are with Wikimedia's caching infrastructure, but I'm going to guess not that familar yet 9:18:04 AM bd808: You win on that. I'm a blank slate :) 9:18:43 AM bd808: I'm pretty good are reading code if you can point me to a good place to start unraveling 9:19:08 AM bd808: and give me a general idea of what you and robla are thinking about putting in place 9:19:51 AM bawolff: How thumbnails work on Wikimedia is basically: User requests at specific size -> Varnish server (layer 1) checks if it has that url cached -> layer 2 of varnish server checks if it is cached -> goes to image scalar, which checks if it is cached in swift (which is like looking to see if it exists locally) -> If not a php script is executed to resize the thumb from the original issue 9:21:00 AM bawolff: When something happens to the thumbnail to make it change, MediaWiki deletes the local copies of the thumbnail, and sends an HTCP purge request (which is a multicast udp packet) to the varnish servers 9:21:19 AM bawolff: This stage doesn't seem to always work 9:22:00 AM bawolff: and we really don't have any monitoring of it. Part of the problem is probably packet loss, as udp is unreliable, but we don't have any monitoring for the entire system falling on its face, which has happened before 9:22:53 AM bawolff: The code on the mediawiki side for sending these purges is in includes/cache/SquidUpdate.php 9:23:22 AM bd808: ok. so some accounting of packets sent vs packets received to watch for patterns? 9:24:11 AM bawolff: That would be ideal probably. However that might be complicated by the fact that there are many servers that send these packets 9:24:42 AM bawolff: I was thinking it might just be better to get a list of files that have recently been modified, fetch them, and check to see if they've been updated 9:25:34 AM bawolff: I guess there's two types of monitoring one might want - general monitoring of packet loss vs monitoring of does the entire system work at all 9:26:16 AM bawolff: In the past we've had situations where the servers had a wrong acl, and rejected all purges, and nobody noticed until an angry mob of users started yelling at us a couple days later 9:26:56 AM bd808: nver fun 9:27:00 AM bd808: *never 9:27:26 AM bawolff: There's a bug somewhere with basically 80 comments of users being angry 9:28:49 AM bd808: so any number of head end boxen could receive a changed file and request a purge of cache. and I'm guessing any number of caches could miss the purge order? 9:29:05 AM bawolff: Other things in the past that have happened, is that some of the varnish servers are in europe. There's a server that acts as a relay, reading the multicast packets, transmiting them to europe over unicast, and then another server at the other end re-assembles them, which has been a fragile in the past 9:29:09 AM bawolff: yeah basically 9:29:57 AM bd808: I suppose replacing the whole thing with 0mq topics would be way out of scope :) 9:30:34 AM bawolff: hey by all means if you want to replace it with something that doesn't suck... 9:30:38 AM bawolff: but yes probably 9:31:18 AM bd808: There seems to be a core architectural issue here in that the current system is designed to be async and lossy but users want eventually consistent behavior 9:31:43 AM bawolff: The original system dates back quite a while 9:32:09 AM bd808: So you can patch around it here and there but as long as it's multicast udp with no watchdog there will be holes 9:32:32 AM bawolff: indeed 9:33:01 AM bd808: but having a better idea of how pervasive the issue is gives input to priority for a replacement vs workarounds. 9:33:13 AM bawolff: For the non-image use case, users are generally not bothered by a missed packet now and again, but generally for images its a much bigger deal 9:33:31 AM bawolff: yes 9:34:25 AM bawolff: Longer term, the idea that's been floated is to change our multimedia architecture so that image urls have a timestamp in them, and then new versions would just have a different url, and we wouldn't need to worry about purging old versions of the cache for images 9:34:49 AM bd808: That would be much slicker for cache invalidation 9:35:06 AM bawolff: very much so 9:35:23 AM bd808: 2 hard things in CS: naming, cache invalidation and fence post errors 9:36:09 AM bd808: so a monitoring script to start with? 9:36:24 AM bd808: where do tools like that generally live? 9:36:52 AM bd808: or do you make the work happen randomly when something else is requested? 9:37:50 AM bawolff: We have labs for general scripts (including things from the community). The ops folks also have various monitoring tools hooked into their own system that I don't know much about 9:38:46 AM bawolff: Even if its not easy to get specific packet loss monitoring/super accurate monitoring, I still think even rough monitoring of "Is this totally broken to the user" would be useful 9:40:01 AM bawolff: I had some thoughts at https://bugzilla.wikimedia.org/show_bug.cgi?id=43449#c15. Brandon somewhat disagrees with me on how useful that type of monitoring would be. You may also want to chat with him to 9:41:14 AM bawolff: On the most basic side, just going to https://commons.wikimedia.org/wiki/File:Heloderma_suspectum_02.jpg?action=purge, waiting a couple seconds, and checking the age header on https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/Heloderma_suspectum_02.jpg/120px-Heloderma_suspectum_02.jpg 9:41:30 AM bawolff: would be a good first check to make sure things aren't totally broken 9:43:02 AM bd808: Does the `purge` verb require any special privileges or can it be thunked anonymously? 9:43:45 AM bawolff: purge (on the mediawiki side) requires you to be logged in to do it from a GET request 9:44:05 AM bawolff: if you're logged out, you need to go through a form 9:44:22 AM bawolff: We don't allow people to send PURGE http requests directly to the cache server 9:44:43 AM bd808: good. makes ddos a little harder 9:48:16 AM bd808: Is there any particular place that you guys use to write problem summaries/design notes? I'm thinking the first place to start here is making a wiki page explaining some of the issues and ideas. Then I can do some POC work on monitor pings to go with it 9:48:44 AM bawolff: Generally we use pages on MediaWiki.org 9:49:37 AM bawolff: Something like https://www.mediawiki.org/wiki/Multimedia/ 9:50:24 AM bd808: Excellent. 9:51:07 AM bd808: I'll start a writeup and show it to you before I shop it around for other input. Sound like a reasonable start? 9:51:19 AM bawolff: Sure. 9:52:14 AM bd808: One last n00b question. Is there a page you'd consider a good example of working through a problem of this nature? 9:53:02 AM bawolff: Umm, not sure off the top of my head 9:53:37 AM bd808: no worries. I'll feel it out. 9:53:55 AM bd808: thanks for your help. I'm sure I'll be back to ask more dumb questions 9:54:01 AM bawolff: I'm sure there is one somewhere (or many such pages, but I can't seem to find any examples) 9:55:28 AM bawolff: The other thing I should mention that might be useful, is we also give an incrementing id to the htcp packets. I think this was a past attempt at packet loss monitoring, but I don't know what happened to it 9:56:23 AM bawolff: but on the php side we still have https://git.wikimedia.org/blobdiff/mediawiki%2Fcore.git/c020b52748e462b973ef72861f413815e34cf647/includes%2Fcache%2FSquidUpdate.php 9:56:28 AM bd808: ah. which could give a cache an idea of how far it was behind (and catch up if there was a way to rewind in the stream) 9:58:36 AM bawolff: To be honest, I'm most familiar with the cache invalidation from the php side of things. The more on the ops side of stuff things get, the less I know 9:59:18 AM bd808: Would Brandon be a good contact for that side of the stack? 9:59:26 AM bawolff: I think so 9:59:52 AM bd808: cool. I'll reach out to him soon too then 10:00:00 AM bawolff: He wrote the deamon that we currently use to turn the multicast purges into things varnish understands 10:00:45 AM bawolff: (Squid understood the htcp packets, which is why we used them in the first place, but then we moved to varnish, which doesn't support it, so we have a bridge thingy on each varnish server to turn the htcp packet to an HTTP purge) 10:01:21 AM bawolff: Anyways, that's in the operations/software/varnish/vhtcpd repo if its useful --BDavis (WMF) (talk) 17:07, 30 July 2013 (UTC)

bd808, paravoid and bblack in #wikimedia-labs on 2013-08-01
[16:12:30] 	 what are you looking for in graphite? [16:12:34] 	 out of curiosity mostly [16:13:38] 	 just poking around. I'm working on a small project to add some tracking of cache purges and will be pushing data there eventually [16:13:48] 	 https://www.mediawiki.org/wiki/Multimedia/Cache_Invalidation_Misses [16:13:49] 	 what kind of tracking? [16:13:52] 	 oh [16:14:31] 	 we want to get some visibility into purge packet loss [16:14:42] 	 we already have such means [16:14:59] 	 o_O [16:15:06] 	 it's kinda strange you're working on this, our team is also working on that [16:15:18] 	 the vhtcpd stats? [16:15:21] 	 yes

...

[16:16:29] 	 we have also discussed pgm/0mq a bit [16:16:41] 	 this is not related to multimedia at all btw [16:16:47] 	 this is how we do purges everywhere [16:16:59] 	 Brandon seems to be against 0mq somewhat [16:17:09] 	 oh you've spoken to brandon already, that's good [16:17:20] 	 a little. one set of emails [16:18:11] 	 I'm starting with the idea of checking Age headers for URIs that the db has marked as overwritten [16:18:49] 	 all urls you mean? [16:18:49] 	 robla thinks this may let us be more proactive about finding things that are borked [16:19:07] 	 that messes with LRU, I don't think this is such a good idea [16:19:44] 	 I think for the short term checking vhtcpd stats is a good measure and for the longer term the solution would be to switch to a lossless transport [16:19:47] 	 hmmm that makes sense. would make things hot arbitrarily [16:20:02] 	 but I think bblack is working on all that? [16:20:32] 	 he has stats in the daemon. I don't think they are being graphed yet. [16:20:42] 	 which may be the best thing for me to add first [16:20:58] 	 I was under the impression he was planning for a ganglia plugin [16:21:31] 	 we should have a larger discussion. what mailing list is good for stuff like this? [16:21:41] 	 or a hangout or something [16:21:55] 	 well, to me this sounds like right within ops' realm [16:22:00] 	 so I'd say ops@ [16:22:10] 	 I think robla and bawolff sould be in on it [16:22:27] 	 otoh, you've filed whis under mediawiki.org/Multimedia, so it's getting a little strange :) [16:23:33] 	 haha. well it landed there because I'm currently on loan to the MM team and they are concerned about bug 49362 [16:23:44] 	 but yeah, varnish & vhtcpd are definitely ops [16:24:02] 	 ops@lists.wikimedia.org [16:24:03] 	 being n00b and a designated utility player will make lots of things I do weird

...

[16:28:45] 	 bd808: it would be more correct to say that I don't think the 0mq+pgm solution is worth it [16:29:00] 	 bd808: I was actually the guy who came up with that idea in the first place, so ... :P [16:29:11] 	 bblack: noted. Not trying to put words in your mouth [16:29:57] 	 I'm glad I poked my head in here randomly. It think this is a great discussion [16:30:43] 	 (I'm glad too, usually #-operations is a better forum for such discussions though) [16:31:11] 	 as in, you have higher chances of getting meaningful answers/discussions :) [16:31:54] 	 bd808: the problem with the 0qm+pgm thing is, if you really want it to be reliable, architecting it becomes really complicated. You need some redundant hubs that handle publish/subcribe between the PHP sender nodes and the vhtcpd receivers, and then you can't lose pubs and subs either, or lose the retrans when one of them crashes [16:32:17] 	 it's not like we don't have a hub to make multicast->unicast right now though :) [16:32:23] 	 (for esams) [16:32:24] 	 from the 1000-ft view, I think you either end up poorly implementing that and adding to the failures, or implementing it really well but the complexity cost isn't worth the results

...

[16:33:33] 	 bblack: I've never done it, but it's theoretically possible to stick RabbitMQ into a 0MQ network to act as a reliable relay [16:33:46] 	 I'd rather see us first monitor vhtcpd better (which is an ops task), detect whether there are any remaining failure modes that matter enough to care. I suspect we'll see a few instances of multicast relay failure or bursts of dropped multicast here or there, but that there will be ways to reduce those issues as they're debugged [16:34:10] 	 stick X into Y and wave magic wand and things are reliable never actually works out so easily in practice :) [16:34:22] 	 I think a reliable transport is something we should plan for the mid term though [16:34:39] 	 we have persistent caches, this means that if we need to reboot for a kernel upgrade or the motherboard fails or something [16:34:46] 	 we'll lose N minutes of purges [16:35:07] 	 but still serve out of cached content [16:35:18] 	 so this has worked so far but it's a bit suboptimal I think [16:35:24] 	 I still don't have a full grasp of the whole purging picture, but I suspect there may be other things we can hack around on that lessen the impact of missed purges in the first place. [16:35:32] 	 bblack: can I help the process by writing the ganglia plugin for your stats? [16:36:25] 	 if you really want to, feel free :) [16:36:29] 	 root@cp1040:~# cat /tmp/vhtcpd.stats [16:36:30] 	 start:1375268516 uptime:106410 inpkts_recvd:35459921 inpkts_sane:35459921 inpkts_enqueued:35459921 inpkts_dequeued:35459921 queue_overflows:0 [16:36:53] 	 start is a unix gmt timestamp, uptime is how long the daemon has been alive, the file is overwritten every so often (30s I think?) [16:37:12] 	 we have similar monitoring of udp2log log streams [16:38:11] 	 inpkts_recvd is multicast that hit the daemon, _sane means it parsed correctly and wasn't garbage, _enqueued means it made it into the sending queue, and _dequeued means it was sent as TCP PURGE to all applicable varnish instances [16:38:28] 	 queue_overflows means the queue backlogged by such a huge volume of memory that the daemon gave up and wiped out all queued requests [16:39:02] 	 we also have nagios alerts for udp2log streams I think [16:39:07] 	 so you just have to track those counters over time into a graph, basically [16:39:23] 	 but these come with a seq number so it's easy to measure packet loss [16:39:42] 	 bblack: sounds easy in theory [16:39:58] 	 graphing is easy, detecting anomalies isn't :) [16:40:15] 	 I mean, sure, it's easy to measure "0", but how about 1-2% packet loss? :) [16:40:19] 	 some easy/obvious triggers would be any of the rates being zero for a few samples in a row [16:40:28] 	 or non-zero queue_overflow numbers [16:40:44] 	 I've never really understood why more people don't use the predictive stuff in rrdtool... [16:41:39] 	 graphite has holt-winters and some other stuff that might be useful [16:41:56] 	 the hard part is figuring out the expected rates in all this [16:41:56] 	 because that's for detecting general trends, not precise measures such as 1-2% packet loss [16:42:13] 	 I wonder if we could instrument the outbound side in mw? [16:43:45] 	 re the LRU cahce tampering inherent in polling for Age headers: is there an HTTP verb that would give us data without corrupting the state? [16:44:34] 	 you could engineer that within varnish, but honestly I think it's silly to hit purged URLs to see if the purge succedeed [16:45:33] 	 I think bawolff's idea here is to do statistical sampling to get an idea about that possible 1-2% loss rate [16:45:55] 	 and some idea if it's constant or peaks with other events [16:46:12] 	 it's probably a lot smaller than 1% too [16:46:27] 	 I think it's an entirely wrong layer to approach this [16:46:39] 	 if we have 1-2% packet loss, it's either network packet loss or some issue with vhtcpd [16:46:40] 	 it's just one of those things in an eventually consitant ssytem that makes content authors crazy [16:46:47] 	 vhtcpd has detailed stats so we can detect this there [16:46:57] 	 network packet loss... well, we can easily measure that can't we :) [16:47:55] 	 without ever checking http [16:48:46] 	 well, probably the most interesting and ripe area for loss would be in the sending and receiving machine's kernel-level network buffers [16:49:17] 	 on the receiving end, vhtcpd is pretty damned efficient at keeping the queue empty and setting the queue size to "super huge", but do the sending machines ever drop on outbound queues? [16:49:55] 	 the amount of htcp on the sending side is much lower than on receiving side though :) [16:50:03] 	 1/N where N is the amount of appservers [16:50:14] 	 and appservers are much less busy in general [16:50:15] 	 linux kernel has (had?) some udp rate issues. Facebook hit them when they switched to UDP for memcache [16:50:32] 	 they're worse machines but not much worse [16:50:40] 	 ok [16:50:50] 	 well, not _that_ worse [16:51:12] 	 noted. worse, but not _that_ worse, but not much worse :) [16:51:27] 	 lol :) [16:52:11] 	 that's my impression anyway, have a look, maybe you'll see something that I won't [16:52:11] 	 I think the long term solution for images is to change the thumb url structure to self version. That won't fix anon conent cache staleness but that's less likely to be noticed by editors [16:52:32] 	 I've heard this idea before [16:52:35] 	 I don't like it much tbh [16:52:48] 	 bd808: stepping out a few layers again, I think some of the reaction to this problem is still grounded in the rate of problems reported in the past. we knew the major fault was the relay daemon, and we replaced it with something better. Need more data now to determine exactly how much of a problem still exists, to know what kinds of solutions are "worth it" [16:53:19] 	 I don't like creating second class citizens wrt caching

...

[16:53:59] 	 bblack: agreed. we need to know what's broke and how often before we can reliably evaluate a fix [16:54:06] 	 bblack: do you know if there are indications when the kernel drops udp?

...

[16:54:14] 	 /proc/sockstat or something similar? [16:54:30] 	 bblack: otherwise we are just poking things and hoping something different happens [16:54:41] 	 /proc/net/sockstat even :) [16:55:31] 	 well /proc/net/snmp has [16:55:31] 	 Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors [16:55:31] 	 Udp: 1104530187 4270 876485 117501449 123041 0 [16:55:46] 	 RcvbufErrors == overflow? [16:56:53] 	  /proc/net/udp has a "drops" field too [16:56:55] 	 but without a way to reset those, it's tricky.  those stats go back to reboot, and we've done a lot of restarting things and fixing things since then, undoubtedly some of that had the daemon not picking up packets for brief intervals, etc [16:57:12] 	 well, rate is what we'll measure, won't it? [16:57:19] 	 ganglia maybe [16:57:33] 	 we monitor drops for udp2log procs [16:57:55] 	 good point [16:57:59] 	 15021: 00000000:12DB 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 2821439826 2 ffff8807f95f0e00 0 [16:58:10] 	 ottomata: :) [16:58:11] 	 :12DB is the multicast listen port, no drops on this particular machine [16:58:29] 	 ottomata: I pointed out the similarity of these problems with udp2log monitoring a bit earlier [16:58:41] 	 aye ja, saw that, [16:58:58] 	 i think the udp2log /proc/net/udp ganglia plugin right now should be genericzied [16:59:03] 	 perfect [16:59:08] 	 it is written to aggregate multiple udp2log instance [16:59:19] 	 i think it should be pluggible to just give the port you want to monitor

...

[17:00:27] 	 so monitoring /proc/net/udp on both senders & receivers, vhtcpd stats and network loss should be enough [17:00:37] 	 bblack: you were saying about overengineering? :)

...

[17:03:19] 	 paravoid's plan would be great for the packets lost on the wire aspect

...

[17:04:05] 	 bd808: I should mention, there are others that will have an opinion about this and much more experience with past failures than I, you should definitely raise it on the list [17:04:11] 	 mark & asher for sure [17:04:18] 	 either ops@ or wikitech@ [17:04:27] 	 bawolff reminded me in side chat that there are other things we want to get a handle on. Apparently tehre have been php issues in the past that kept packets from being sent at all [17:05:37] 	 the php side borks may have been the root of idea to poll Age headers [17:05:40] 	 (wikitech@ being completely public) [17:06:25] 	 Age has multiple other issues too [17:06:33] 	 our caching is tiered for one [17:08:08] 	 true, but the thumbs bawolff is interested in are overwrites that should have gone all the way back to the bottom [17:09:17] 	 great discussion. I'll add stuff to the wiki and try to get a discussion going on ops@ (as soon as I'm moderated in) --BDavis (WMF) (talk) 17:58, 1 August 2013 (UTC)