Analytics/Archive/Infrastructure/Oozie

ottomata		drdee!!!! YESSSSSS IT IS WORKING			3:47 ottomata		http://hue.analytics.wikimedia.org/filebrowser/view/user/otto/mobile_hour_by_continent_A			3:47 ottomata		its running now, going through the existing dat			3:47 ottomata		data			3:47 drdee		COOOL BEANZ MR OTTOMATA!!!!! 3:47 dschoon		hot! 3:47 ottomata		using 6 input files each run, but only generating output for a single hour each run			3:47 ottomata		its really smart! 3:47 ottomata		drdee, can I show you how this works real quick? 3:48 drdee		yes please do! 3:48 ottomata		http://hue.analytics.wikimedia.org/filebrowser/view/user/otto/oozie/webrequest/count_by_hour_by_continent_A?file_filter=any			3:48 drdee		coordinator? 3:48 ottomata		there is also a coordinator.properties file, but it has the usual stuff you'd expect			3:48 ottomata		yeah			3:48 ottomata		look at workflow.xml first			3:48 ottomata		because you're used to it			3:49 drdee		yup			3:49 ottomata		actually, I think everythign in workflow is stuff you have seen before			3:49 ottomata		with the parameters etc.			3:49 drdee		yes i have			3:49 ottomata		the only thing I added for my stuff was ${HOUR_REGEX}			3:49 ottomata		i'll show you how that gets computed			3:49 ottomata		but, that can be anythign you want to figure out which hours you are interested in			3:49 ottomata		we can probably abstract that concept out later for any timestamp			3:49 drdee		only one minor thing is to change the kill action, and make it send an email			3:50 ottomata		when we get better at abstracting pig stuff			3:50 drdee		but that's minor			3:50 ottomata		ah ok			3:50 ottomata		cool			3:50 ottomata		ok cool, so that's workflow			3:50 ottomata		now checkout coordinator.xml			3:50 drdee		aight			3:50 drdee		reading			3:50 ottomata		at the top is the dataset definition			3:50 drdee		k			3:50 ottomata		you've seen that before too			3:50 ottomata		for mobile			3:50 ottomata		frequency is 15 minutes			3:50 drdee		yup			3:50 ottomata		ok, the new stuff is in input-events and output-events			3:51 ottomata		just below that			3:51 ottomata		    ${coord:current(-5)} ${coord:current(0)} 3:51 ottomata		here i'm creating an input event parameter called INPUT			3:51 drdee		k			3:51 ottomata		and saying that it includes all webrequest-wikipedia-mobile dataset instance between −5 instance ago and the current one			3:51 ottomata		so			3:52 ottomata		6 instance total			3:52 ottomata		so if the current one is 08:00			3:52 ottomata		that would be ${coord:current(0)			3:52 ottomata		then			3:52 ottomata		${coord:current(-5)			3:52 ottomata		would be			3:52 ottomata		06:45			3:52 ottomata		so for 15 minute intervals stating at 06:45 and ending at 08:00			3:52 ottomata		that ends up being			3:52 ottomata		06:45 07:00 07:15 07:30 07:45 08:00			3:53 drdee		smart indeed			3:53 ottomata		so			3:53 ottomata		even though coord:current(0) is 08:00			3:53 ottomata		we are actually interested in computing the data for the 07:00 hour			3:54 ottomata		because that is the hour for which we know we have all the data			3:54 ottomata		so			3:54 ottomata		the output instance is defined as			3:54 ottomata		    ${coord:current(-4)} 3:54 ottomata		coord:current(-4) == 07:00			3:54 ottomata		(you could compute a few different ways, i'm sure if you wanted to)			3:54 ottomata		but that works			3:54 drdee		soooooooo, copy this explanation from the chat log and put it in a wiki 			3:54 ottomata		I will! I will! I'm not quite ready yet			3:55 ottomata		one more thing			3:55 ottomata		ok			3:55 drdee		ok ok ok ok 			3:55 ottomata		i tell you waht, i'll copy it for now			3:55 ottomata		but i won't organize it yet			3:55 ottomata		it will be organized, believe you me! 3:55 ottomata		but, yeah, one more thing			3:55 drdee		I BELIEF YOU! 3:55 ottomata		hehe			3:55 ottomata		down below,			3:55 ottomata		in the section			3:56 ottomata		there are a bunch of properties defined, those get passed to the workflow as variables			3:56 ottomata		(that's how OUTPUT and INPUT get set for the pig sscript)			3:56 drdee		and that makes it a full circle! 3:56 ottomata		I also made it so pig will filter for only an hour regex			3:56 ottomata		which is by default .*			3:56 drdee		k			3:56 ottomata		so i'm setting $HOUR_REGEX			3:56 ottomata		to ${coord:formatTime(coord:dateOffset(coord:nominalTime, -1, 'HOUR'), 'yyyy-MM-dd_HH')} 3:57 ottomata		coord:nominalTime is jsut the timestamp for the current run			3:57 drdee		and final final request is to run this using the 'stats' user			3:57 ottomata		in my example that would be 08:00			3:57 ottomata		so, i'm subtracting an hour from 08:00			3:57 ottomata		and putting in the hour format the the pig script is filtering on			3:57 ottomata		yupyup			3:57 ottomata		anyway, yeah! 3:58 ottomata		its running			3:58 ottomata		whenever they release a new hue			3:58 drdee		this is really f***** cool			3:58 ottomata		we should be able to select multiple datasets like this with hue			3:58 ottomata		for now, that's not supported though			3:58 ottomata		so you have to edit your .xml files and submit via cli			3:58 drdee		and this data is already visualized through limn as well			3:58 drdee		? 3:58 ottomata		not yet, that's what I need to figure out next			3:59 ottomata		best way to script that together			3:59 drdee		k			3:59 drdee		aight			3:59 ottomata		hopefully as an oozie action in the workflow too			3:59 ottomata		erosen is going to make limnify accept via stdin			3:59 ottomata		and i'm going to look into running a shell oozie action with it			3:59