User:ATDT/Notes

Chat about Tagging
Bits of this should be cannibalized for the documentation. 11:44 You have joined the channel 11:44 ori-l has joined (~ori-l@wikipedia/ori-livneh) 11:44 lindbohm.freenode.net has changed mode: +ns 11:44 ori-l has changed mode: +t 11:45 halfak has joined (halfak@tako.cs.umn.edu) 11:45 DarTar has joined (~DarTar@wikimedia/DarTar) 11:45 (halfak) Howdy 11:45 (ori-l) hey 11:45 (DarTar) ahoy 11:45 (DarTar) so to wrap up the general intro, 11:45 (halfak) Can you remind me where I find the docs for our use of redis? 11:45 (halfak) Oh sure. 11:45 (DarTar) 1) we're deprecating clicktracking 11:46 (DarTar) and replacing it with a new event logger 11:46 (DarTar) http://www.mediawiki.org/wiki/Event_logging 11:46 (halfak) When you say "clicktracking", you mean the current method of tracking such things, right? 11:46 (DarTar) loosely documented there 11:46 (DarTar) I actually mean Clicktracking with a capital C 11:46 (DarTar) aka the Mediawiki extension 11:46 (halfak) Gotcha 11:46 (DarTar) the main cause of pain over the last 2 years of work at WMF 11:47 (halfak) I have some contenders for that :P 11:47 (DarTar) so, we're discontinuing it and use a new logger instead that allows us to send arbitrary key,value arrays from the client 11:47 (DarTar) ha ha 11:47 (ori-l) it's broader in scope, tho 11:47 (DarTar) so there will be no more logs on emery 11:47 (ori-l) so 'event logging' might be a more useful way of thinking about it. it's an end-to-end data pipeline 11:47 (halfak) Awesome. 11:47 (DarTar) but data being stored in real time in redis on a machine called vanadium 11:48 (DarTar) which is all ours rooarrrr 11:48 (DarTar) we can already access redis on vanadium via a client installed on stat1 11:48 (DarTar) (whether or not we will perform data analysis on vanadium itself it's a question we need to figure out) 11:48 (DarTar) for the time being it's ok to pull data from tjere 11:49 (DarTar) and even write into it! 11:49 (DarTar) ori-l: wanna follow up on this? 11:49 (ori-l) sure, i'll jump in 11:49 (ori-l) so first a 2-minute intro to redis 11:49 (halfak) Excellent 11:49 (ori-l) redis is, nominally at least, a key-value store, so you can think of it as a networked python dictionary or hash map 11:49 (ori-l) have you used memcached? 11:50 (halfak) Just a bit. I get the notion. 11:50 (ori-l) redis extends the memcached concept by providing richer data types than memcached, which treats every value as a string 11:50 (halfak) How complicated can "keys" be? 11:50 (DarTar) add this page to your bookmarks: http://www.mediawiki.org/wiki/Redis 11:50 (ori-l) you get sets, sorted sets, hashes, lists, etc. 11:50 (halfak) json-like? 11:50 (halfak) or json proper? 11:51 (halfak) oh wait... sets != json. :) 11:51 (ori-l) json is a data *serialization* format 11:51 (halfak) I <3 sets 11:51 (ori-l) hashes map onto json very neatly 11:51 (halfak) Meh. JSON also defines datatypes. 11:51 (ori-l) erm, okay -- let's set that aside for a moment 11:51 (halfak) Sure 11:52 (ori-l) the data structures that you get, plus built-in capacity for lua scripting and pub/sub 11:52 (ori-l) makes redis excellent in two key respects 11:52 (ori-l) one, it's close to the metal, so you get really awesome throughput 11:52 (ori-l) two, the data types it gives you are in many respects the same ones you'll see in an algorithms textbook 11:52 (ori-l) so it gives you the tools you need to a model a pretty broad range of problems 11:53 (DarTar) three, the data structures and commands are very intuitive and python-ish! 11:53 (ori-l) :) 11:53 (ori-l) the major downside, if you compare redis to a typical relational database like mysql 11:54 (ori-l) is that it places a lot more responsibility on you to organize the data and provide means of discovery / introspection / pruning / etc. 11:54 (ori-l) so, here's how we're using it -- i'll start with a general overview and then dive into tagging specifically 11:55 (ori-l) how is all this so far -- comprehensible? any pressing questions? 11:55 (halfak) No questions. Very straightforward. 11:56 (ori-l) cool cool 11:56 (halfak) I'm reading some docs while you type :) 11:56 (ori-l) (aside: the redis prompt in the docs is interactive!) 11:56 (ori-l) okay, so data is going to come in to redis via two routes: 11:57 (ori-l) one, generated server-side by mediawiki and sent via udp to vanadium. this doesn't scale, so we're going to restrict it to a well-defined set of events that are generally valuable 11:57 (ori-l) the set currently includes article creation, article edit, and account creation, each with their associated metadata 11:58 (ori-l) but it's not finalized, so we may add a couple of additional events. but the basic thing to keep in mind is that this isn't a public API for any PHP code to log stuff at will. 11:58 (halfak) can you explain how mediawiki->udp->redis doesn't scale? 11:58 (ori-l) basically, the udp logging infrastructure is moribund 11:59 (ori-l) opinions differ (sharply) as to when it'll actually be retired) 11:59 (halfak) Why log something like an article edit when MySQL already handles this? 11:59 (ori-l) excellent question, but i'll defer answering it for a minute or so if that's ok 12:00 (halfak) sure 12:00 (ori-l) one way it doesn't scale is that udp datagrams exceeding a certain length -- which is network hardware specific -- just get dropped silently 12:01 (DarTar) halfak: that's a "data policy" I'm also very interested in, we can chat about this separately from this intro 12:01 (ori-l) so to have valid data we need to have certain guarantees about the maximum length about various fields 12:01 (halfak) Ahh... yes. 12:01 (ori-l) otherwise we'll just miss some stuff without knowing it 12:01 (ori-l) which is the sort of stuff that keeps you up at night :P 12:02 (ori-l) okay, so data conduit #2 is the public api 12:02 (ori-l) this is a special url endpoint on bits.wikimedia.org, the host that is currently serving static assets on wmf wikis 12:02 (DarTar)  12:02 (ori-l) basically, any requests to /event.gif followed by an arbitrary query string get logged in a special way 12:03 (ori-l) http://bits.wikimedia.org/event.gif?key=val&foo=bar 12:03 (halfak) Interesting. We are limited by the length of a query string in a similar way. 12:03 (ori-l) yes, 2000 characters. but we wrap the operation with a guard function that throws JS error if it is exceeded 12:05 (DarTar) actually, size limitations for the parameters is a separate, important issue, but I'd rather have you guys focus on the general architecture and we'll review this ;ater 12:05 (ori-l) we may also decide to constrain the input for the logger function in other ways, and impose a stricter data model.. but there's just a lot of stuff that just needs to be operationalized before that becomes pressing 12:06 (ori-l) currently the only requirement is that there be an event_id=some_foo_bar key/val 12:07 (ori-l) each incoming event is logged into redis in the following way: 12:07 (ori-l) first, a UUID is generated for that specific event 12:08 (ori-l) we look up the sorted set that is keyed to the event_id, and push that UUID, with a "score" that is the event's timestamp 12:09 (ori-l) next, we create a redis hash, keyed to the UUID 12:09 (ori-l) and simply deserialize the query string into hash keys and values 12:09 (halfak) (Non-important note: This is an interesting definition of "set".) 12:09 (halfak) define: hash 12:09 (ori-l) nested dictionary 12:10 (halfak) something like: get "dictionary" "key" 12:10 (ori-l) exactly 12:10 (halfak) OK 12:10 (ori-l) so if redis = {} 12:11 (ori-l) redis['UUID-1234-4567'] = { 'event_id': 'click', 'user_token': 'ABCD', 'some_property': 'some_value' } 12:11 (halfak) understood 12:12 (halfak) get UUID-1234-4567 event_id --> "click" 12:12 (ori-l) right. you can also get all keys and values 12:12 (ori-l) finally, we also "publish" the event in a "channel" that has the name of the event_id. what this means is 12:13 (ori-l) you can connect to redis and say "tell me whenever an event 'foo' comes in" 12:13 (ori-l) you are then subscribed to "foo" events 12:13 (ori-l) if the event_id on an incoming event matches your subscription, you get notified 12:13 (halfak) This is familiar to me from working with xmpp servers. 12:13 (ori-l) ah awesome! okay, yeah, if you know xmpp architecture this should be trivial 12:14 (ori-l) the pub/sub functionality allows us to use redis as simply a piece of middleware rather than the final end-point for certain data sets 12:14 (ori-l) so it's trivially simple (and i'll have sample code for this) to write a small python adapter that listens to event_ids foo, bar, and fizz, and inserts them into MyFavoriteDataStore 12:15 (halfak) This would be awesome. 12:15 (halfak) I'd love to listen to the recentchanges table. 12:15 (ori-l) whether or not you do the analysis on redis or simply treat it as a pit stop will depend on your (meaning you / dario / faulkner)'s needs and preferences 12:15 (ori-l) heh. http://kubo.wmflabs.org/editstream.html 12:16 (halfak) (This is a side topic, but I really want to know how this is done. Short polling or proper pushing?) 12:17 (ori-l) socket.io, so websockets if your browser supports it, with fallbacks to flash and ajax long polling 12:17 (halfak) nevermind... looked at js. I'm satisfied. Thank you for showing me this. I have a use for this immediately. 12:17 (ori-l) so proper pushing unless you'er running something archaic 12:17 (ori-l) kubo.wmflabs.org/series.html 12:18 (ori-l) the decomposition of events isn't accurate -- there are some bugs in my dashboard code -- you can also see from the blanks in the graph when i took it down to update / upgrade / fix bugs :) 12:19 (ori-l) but it's pretty cool -- if you look at API edits / inserts you can "see" cron-scheduled bots start up and shut down 12:19 (halfak) Pretty cool, for sure. Is this coming through redis? 12:19 (ori-l) yeah 12:20 (halfak) With this system, could I ask about the last 5 days events? 12:20 (ori-l) aha so! we have different standards here because we're E3 chauvinists 12:20 (ori-l) so the answer is, *you* can 12:20 (halfak) yay! 12:20 (halfak) I'm special 12:21 (ori-l) because the edit / insert stream contains no non-public data we're actually going to expose it as a public service on stat1001 12:21 (ori-l) but regarding other event types 12:21 (ori-l) they won't be exposed in general 12:22 (ori-l) and whether or not they'll be stored in redis or simply published will depend on the internal consumer 12:22 (ori-l) if it's us and we want it to persist, it will 12:22 (halfak) Cool. That's exactly where I was hoping to use it. 12:22 (ori-l) if it's other teams, it'll be their responsibility to have a listener running that stuffs the incoming data into their data store 12:23 (ori-l) heads up: both dario and i have to go to some lunch thing in 5 mins 12:23 (ori-l) we barely got to tagging, but i'm happy to continue this afterwards or at a time of yoru choosing.. 12:24 (halfak) Can you tell me about where I can safely test out redis and point me to docs? 12:24 (DarTar) I think it's important that Aaron sees at least the barebone tag writing/tag reading 12:24 (halfak) We may not need to meet again. 12:24 (ori-l) but the short answer to your question from earlier (why reproduce data that is already in mysql) 12:24 (ori-l) is: in general we shouldn't, *unless* it's useful to apply that data as a filter on real-time event streams 12:24 (DarTar) we should have that question in a FAQ on mw 12:24 (DarTar) or 12:24 (ori-l) yeah! 12:25 (ori-l) um, okay, so do you have access to my home dir on stat1? 12:25 (ori-l) can you try to cd to /home/olivneh/glass/glass? 12:25 (DarTar) well, forget my or - that's good enough' 12:25 (halfak) looking @ glass 12:26 (ori-l) you can look at that code to see how tagging works 12:27 (halfak) I see that you specify the datatype when asking for the value for a key. 12:27 (ori-l) in general it's an example of how a problem domain (tagging) can be mapped onto a "dumb" data type (bit array) using python as an abstraction layer 12:27 (halfak) getbit 12:27 (DarTar) I have to go, if you guys can reconvene on IRC around 1 I should be able to join again if not feel free to find a time that works best for you 12:27 (ori-l) yeah 12:27 (ori-l) there's a lot more to cover there so maybe we should reconvene later? 12:27 (halfak) You guys should head. Let me play around. 12:27 (ori-l) cool cool 12:28 (halfak) How do I not put trash in a place where it will be trouble? 12:28 (ori-l) install redis locally? 12:28 (ori-l) it's very small 12:28 (DarTar) confirmed 12:28 (ori-l) ok, gotta run 12:28 (ori-l) i'll be back at 1 12:28 (halfak) Thanks guys. 12:28 (halfak) I got a lot from this 12:28 (ori-l) yayayay 12:28 (ori-l) i'm super excited about this so happy that you are too 12:28 (halfak) :) 12:28 (ori-l) tty soon