User:ATDT/IRC logs

From mediawiki.org

Event logging security[edit]

#mediawiki-dev,12-Oct-2012

(ori-l) Hi, I have a design question I could use some help with. I'm pushing client-side event data to bits (bits.wikimedia.org) from *.wikipedia.org pages. I want some protection from CSRF, but the machine this data is ending up on doesn't have easy access to the production mediawiki cluster, so I can't simply randomly generate some value and bury it in the session, and then retrieve it again. I figure I ought to either pre-generate some large quantity of keys and have them available at both ends (mediawiki + log receiver) or derive keys from some shared secret value, but do so in a way that won't be easily crackable.
(ori-l) My questions are: can I use one of the existing token interfaces (edit tokens, API auth tokens) for this? If not, can some existing implementation be easily extended to fulfill this purpose? And if not, how should I implement it?
(TimStarling) you can use the referrer
(ori-l) easily spoofed
(TimStarling) how can it be spoofed?
(ori-l) curl -H Referer ?
(TimStarling) CSRF doesn't affect curl
(ori-l) well, that's a good point, but i guess CSRF isn't all i'm worried about
(TimStarling) what exactly do you want protection from?
(ori-l) i don't want someone to mess up the result of A/B tests by just curling fake events in a loop from the command line
(TimStarling) you want to make it slightly more difficult?
(TimStarling) if you require a session, then someone could write a bot to get lots of sessions
(TimStarling) presumably you're not intending on using a captcha
(ori-l) no :)
(ori-l) ideally have to retrieve a page for each +/- 5 events you send would make it impractical
(ori-l) *having
(TimStarling) so what you want is security by obscurity
(TimStarling) don't get me wrong
(TimStarling) it's often maligned but it's better than no security
(TimStarling) you want to have a JavaScript module that makes requests that are somehow difficult to generate without understanding the relevant JavaScript code
(ori-l) i mean, each events must declare an event id which references a data model that is available both to mediawikis and the log endpoint
(ori-l) i could hash that using some gnarly js code with lots of bitwise operators and what have you
(TimStarling) you can't use the existing interfaces (edit tokens etc.) because you said that you can't use sessions
(ori-l) but that doesn't seem very smart
(TimStarling) and the existing interfaces rely on sessions
(ori-l) what level of security do you think is reasonable for something like this?
(ori-l) the ClickTracking API endpoint has no such security and we don't have evidence that anyone has been screwing with it. of course, they could be doing it in a subtle way that we're not detecting. but at least there's no blatant junk being written by anyone.
(TimStarling) what generates the event ID?
(TimStarling) the server or the client?
(TimStarling) in clicktracking, the client generates an event ID, correct?
(ori-l) well, there's a reference to a data model. the data model is put in place by us and assigned an id by us, but events generated on the client reference that id. there's also a uuid assigned to each event instance, but that only happens on the log collector
(ori-l) yeah
(TimStarling) I think having no security is a reasonable way to do this
(ori-l) what's your rationale?
(TimStarling) well, there is no profit motive, not much of a motive of any kind for fudging the numbers
(TimStarling) lots of similar statistics gathering already live which has no security, and no reports of people screwing with it
(TimStarling) but if you're uncomfortable with that...
(TimStarling) the next step up is probably IP-based rate limiting
(TimStarling) just discard events from an IP if it seems to be sending you an unlikely number of events
(ori-l) yeah, already doing that
(TimStarling) you can adjust for the effect at the analysis stage
(ori-l) that's (partly) the basis for me saying no one is currently screwing with us, to my knowledge
(TimStarling) so if you're doing that and you're still worried, what's the attack scenario?
(TimStarling) someone with a botnet?
(ori-l) i'm not paranoid so much as self-doubting. i'm just wondering if there's some standard and simple way of securing a setup like that, that i'm not reaching for because of ignorance.
(ori-l) if you say there isn't one, maybe that's good enough
(TimStarling) well, you can tie it to wiki user accounts
(TimStarling) that could be made to be reasonably secure
(TimStarling) but I don't think there's any way to secure anonymous events
(ori-l) yeah, that just opens a different can of worms (privacy, legal issues, ethical issues, etc.)
(TimStarling) unless you want to talk about increasingly complex methods which do increasingly little to stop abuse
(TimStarling) like client-side hashing of event IDs
(TimStarling) if someone already has a botnet and understands the protocol and what they want to achieve, a few shifts and rotates probably won't slow them down much
(ori-l) yes, you're right
(ori-l) do you know how third-party analytics providers handle this? presumably there could be a profit motive (if your competitors A/B tests heavily, you could systematically skew their results)
(TimStarling) I don't know how they handle it, but I would be surprised if they did anything secure
(TimStarling) you would think that if there was any event interface that was secure, it would be advertising referrals
(ori-l) true
(TimStarling) but I've read articles that say that even that is not secure, despite widespread fraud
(TimStarling) costing large amounts of money
(ori-l) okay, so i'm really going for no security at all, aside from post-hoc sanity checks on the data which we already do
(TimStarling) sounds good
(ori-l) i think mangling the data client-side won't improve security and it'll hurt the nice debugability we get from having pretty readable query strings flying around
(ori-l) and it might also mislead data analysts into thinking the setup is more secure than it actually is
(ori-l) so it's probably best to just have it be nakedly insecure and if that seriously unnerves someone they probably shouldn't be using it
(ori-l) thanks TimStarling
(TimStarling) yw