Thread:Talk:Requests for comment/DataStore/IRC meeting 2013-10-02

&lt;TimStarling>	I think Brion has already said something favourable about this &lt;TimStarling>	I don't see it on the page, maybe it was in an in-person meeting &lt;gwicke>	I like the general idea of having a key/value store available without creating extra tables &lt;legoktm>	what gwicke said &lt;mwalker>	I voiced in the comments that I think this should have some sort of defined structure per key -- that way we can have a unified upgrade process (like we have with a database) and also have a way of filling in initial values &lt;yuvipanda>	+1 &lt;mwalker>	otherwise I failed to see the difference between this and just using memcache &lt;legoktm>	this would be persistent &lt;TimStarling>	well, persistence &lt;legoktm>	memcache isn't &lt;gwicke>	in distributed storage range queries are not free, so it might make sense to make those optional &lt;gwicke>	similar with counters &lt;TimStarling>	mwalker: so you're thinking of some sort of schema definition for values? &lt;mwalker>	yes &lt;mwalker>	that way you have a defined upgrade / update process &lt;MaxSem>	schema definition: serialize( $struct ) &lt;mwalker>	MaxSem: how do you handle a multiversion jump though? if the structure has evolved and suddenly you dont have the data you expect &lt;mwalker>	you can handle that in the consuming code of course -- but that's a lot of boilerplate that I think is redundant &lt;TimStarling>	mwalker: what do you imagine the upgrade process would be/ &lt;MaxSem>	if you want schemas and upgrades, it's a good readon to use MySQL tables &lt;TimStarling>	? &lt;legoktm>	mwalker: i think that's something that the extension needs to handle, with proper deprecation &lt;legoktm>	and migration &lt;gwicke>	mwalker: all you need is a way to traverse the keys and update all values I guess &lt;mark>	some update handler per key/value on fetch? &lt;gwicke>	you can have a version key in each JSON blob you store for example &lt;mark>	supported by the extension &lt;mwalker>	could do it on fetch, or could do it in a maintenance script &lt;mark>	whichever comes first &lt;TimStarling>	mwalker: how would a schema help with upgrading? what boilerplate would be abstracted exactly? &lt;gwicke>	we might want different kinds of key/value stores: those that are randomly ordered and only support listing all keys, those that are ordered and allow efficient range queries, and those with special support for counter values &lt;mwalker>	TimStarling: I imagine that this will probably be abused to store dicts and arrays -- if we now what we're coming from and going to; we can define transforms for the old data into the new &lt;TimStarling>	the requirement for prefix queries does appear to limit the backends you could use &lt;gwicke>	yes, or at least it creates extra overhead for those that don't need the feature &lt;legoktm>	mwalker: i dont think storing an array is abusing the feature ;) &lt;TimStarling>	mwalker: abused? &lt;MaxSem>	gwicke, if you don't want to use prefix queries, don't use them &lt;mwalker>	TimStarling: the examples given in the RfC are simple values &lt;gwicke>	MaxSem: yes, that's why I propose to have different key/value storage classes &lt;MaxSem>	because there can be multiple stores, you can always make some assumptions about the store you're using &lt;mwalker>	I say abused because I see no provision for dealing with more complex values (which is what I'm proposing :)) &lt;gwicke>	 /ordered-blob/ vs /blob/ for example &lt;TimStarling>	mwalker: maybe you misunderstood MaxSem then, because he just said he thinks values should be serialized with serialize &lt;mark>	it would probably be good to classify those different stores in the RFC, define the ones likely needed &lt;yuvipanda>	mwalker: perhaps add more data types to the RFC? Lists and Hashes, maybe. I guess different stores can define different datatypes that they support &lt;gwicke>	mark: I have some notes at https://www.mediawiki.org/wiki/User:GWicke/Notes/Storage#Key.2Fvalue_store_without_versioning &lt;mwalker>	TimStarling: yes; but serialization doesn't solve the problem of knowing what's in the structure &lt;MaxSem>	mark, the proposal comes with a skeleton code for an SQL store and has a Mongo as another example &lt;mwalker>	if you serialize a php class for example -- deserializing it into a class with the same name but different structure gives very unexpected results * 	gwicke lobbies for JSON over serialize &lt;TimStarling>	I imagine it would be used like the way memcached is used &lt;mark>	yeah, nothing too PHP specific ;) &lt;MaxSem>	gwicke, doable &lt;TimStarling>	i.e. avoiding objects wherever possible, primarily serializing arrays, including a version number in the array &lt;MaxSem>	:) &lt;TimStarling>	when you fetch a value with the wrong version, the typical response in memcached client code is to discard it &lt;TimStarling>	with persistent data, you would instead upgrade it &lt;gwicke>	MaxSem: ok ;) &lt;TimStarling>	that upgrade could be done by some abstracted schema system &lt;TimStarling>	or it could be done by the caller, correct? &lt;mark>	also, is this proposal intended to embrace larger key/value storage applications like... images? external storage? &lt;mwalker>	TimStarling: yes -- that's where I'm going -- but I'm agitating for the schema system so the caller doesn't have to care every place its used &lt;mark>	it doesn't seem to be, but I believe it's not mentioned &lt;MaxSem>	mark, I intended to maybe use it for storing images on small wikis &lt;TimStarling>	mwalker: I think you should write about your idea in more detail &lt;TimStarling>	since this is not exactly a familiar concept for most MW developers &lt;gwicke>	mark: the Cassandra stuff just came up in parallel &lt;mwalker>	TimStarling: ok; I'll write that up tonight &lt;TimStarling>	maybe you could even write a competing RFC &lt;mwalker>	which do you think would be better? &lt;MaxSem>	but it's too generic for an image store of our scale &lt;mark>	when we're either talking about many objects into the millions, or potentially very large objects into the gigabytes, that can matter a lot :) &lt;TimStarling>	mwalker: I would like to know what the API will look like before I decide &lt;TimStarling>	and I would want comments from more people &lt;MaxSem>	mark, the key here is "small wikis":) &lt;mwalker>	TimStarling: ok; I'll write it up as a separate RfC &lt;TimStarling>	yeah, I think that would be easiest &lt;TimStarling>	now, there are obvious applications for a schemaless data store &lt;gwicke>	mark: objects into the gigabytes are unlikely to be handled well by a backend that is also good at small objects &lt;mark>	gwicke: that is my point &lt;TimStarling>	because there are already schemaless data stores in use &lt;TimStarling>	ExternalStore, geo_updates, etc. &lt;gwicke>	mark: I'm interested in the 'at most a few megabytes' space &lt;MaxSem>	so far to move this proposal forward I'd like people to agree upon interface &lt;gwicke>	primarily revision storage &lt;mark>	yes, we should probably make that a bit more explicit in the RFC &lt;TimStarling>	is it possible to have both a schema data store and a non-schema data store? &lt;TimStarling>	one could be implemented using the other &lt;TimStarling>	I think that would suit existing developers better &lt;mark>	2 layers of abstraction &lt;TimStarling>	yeah, well that seems like the minimum here &lt;TimStarling>	schemas are not so simple that you would want to do them in a few lines of code embedded in a data store class, right? you would want to have a separate class for that &lt;mwalker>	I think this could even overlay our current memcache &lt;gwicke>	schema as in actually storing structured data and allowing complex queries on it? &lt;gwicke>	that sounds like sql.. &lt;mwalker>	just getStore('temporary') or something &lt;MaxSem>	another question: does anybody want eg getMulti and setMulti? &lt;MaxSem>	mwalker, temporary is BagOStuff &lt;TimStarling>	MaxSem: ObjectCache callers don't use getMulti very often... &lt;gwicke>	MaxSem: I think it would be great to have that capability for any service backend &lt;yuvipanda>	+1 &lt;mwalker>	this is a PersistantBagOStuff though :) why should the API be different &lt;TimStarling>	in core, just filebackend, by the looks &lt;gwicke>	can be based on curl_multi &lt;TimStarling>	but it is generally considered to be a good thing to have &lt;mark>	it's not always efficient to implement temp/expiry/caching with every service backend &lt;mark>	oh, misunderstood &lt;TimStarling>	no, I think persistent storage does need a different API &lt;mwalker>	yes; mark raised a point I hadn't thought of &lt;TimStarling>	well, ideally &lt;TimStarling>	redis handles persistent storage well enough with a mixed API &lt;gwicke>	there are some backends with built-in expiry &lt;mwalker>	*if you set a TTL of zero; it goes into the persistant store? &lt;gwicke>	amazon handles the ttl with special request headers &lt;TimStarling>	anyway, BagOStuff brings a lot of baggage (ha ha) &lt;gwicke>	mwalker: you set it per object normally &lt;TimStarling>	presumably DataStore would be simpler than BagOStuff &lt;gwicke>	same is available in cassandra &lt;gwicke>	but would be good to check other backends &lt;TimStarling>	it wouldn't have incr/decr or lock/unlock &lt;mark>	swift does it, the swift compatible ceph counterpart doesn't &lt;TimStarling>	with a simpler API, DataStore could have more backends than BagOStuff &lt;MaxSem>	TimStarling, I actually have increment - wonder if it's really needed &lt;PleaseStand>	Would we need an atomic increment for things like ss_total_edits? &lt;gwicke>	I'm pushing for a web service API &lt;MaxSem>	it could be helpful eg for implementing SiteStats with DataStorew &lt;MaxSem>	gwicke, web service API will be one of backends &lt;gwicke>	PleaseStand: not atomic, but consistent &lt;gwicke>	that should be a special storage class &lt;MaxSem>	gwicke, know why memcached doesn't work over HTTP? &lt;TimStarling>	MaxSem: maybe you should write a bit on the RFC about what backends you imagine this using, and what their capabilities are &lt;gwicke>	MaxSem: efficiency for very small fetches &lt;mark>	it's not UDP? ;) &lt;gwicke>	afaik it's tcp &lt;TimStarling>	w.r.t. prefix search, increment, lock, etc. &lt;mwalker>	facebook wrote one with udp &lt;TimStarling>	add, cas? &lt;MaxSem>	stupid facebook &lt;TimStarling>	ObjectCache provides all these atomic primitives &lt;mark>	max size of objects &lt;gwicke>	TimStarling: cas on etag? &lt;gwicke>	can be supported optionally in some backends &lt;TimStarling>	I just would like to know if the applications require all these atomic primitives &lt;TimStarling>	and if that limits our backend choice &lt;MaxSem>	TimStarling, cas doesn't seem to be very mixable with eventual-consistency backends &lt;TimStarling>	essentially, there is a tradeoff between feature count and backend diversity, right? &lt;gwicke>	I'd start with the minimal feature set initially &lt;TimStarling>	so we want to know where on the spectrum to put DataStore &lt;mark>	i think an application like gwicke is interested in (external storage like) is already quite different from the counter/stats like applications also discussed here &lt;gwicke>	and then consider adding support for something like CAS when the use case and backend landscape is clearer &lt;TimStarling>	that tradeoff is not discussed on the RFC, so I would like to see it discussed &lt;mark>	agreed