Talk:Architecture Summit 2014/Storage services

Session Storage Services

= one set of notes = breakout session from https://etherpad.wikimedia.org/p/summit_day2

API versioning
Initial question: How do we obsolete things that are awkward/broken/etc?
 * Or how do we improve things without breaking them?

Criteria for switching off deprecated features:
 * time-based? deadline?
 * usage-based? did people stop using it?
 * something else?

Discussion

 * Sumana: how do other similar apps deal with this?
 * GW: they have a lot of staff!
 * brad: they are monolithic and can just say "now we close this"; our community comes after us with pitchforks

... ... Convincing script writers to fix their stuff? have back-compat scripts to help transition; show warnings back to the users...


 * Yuri: we have warnings, people don't read them
 * Mark: add a sleep call on the deprecated one ;)
 * duesentrieb: break things with error message progressively....?
 * ?: it'll take at least a year to convert people!
 * aude: toolserver, labs, lots of things are changing --
 * sumanah: maybe change one thing at a time?
 * tor: don't think waiting is a good strategy; identify key users of each API -- the things you don't want to go down during a switchover -- and try to communicate with them
 * yuri: sometimes all we have is their IP; we can DDoS them to get attention... ;)
 * yuri: encouraging people to use better user-agent strings with email addresses, etc.
 * bryan: netflix example -- ships code that goes in long-running consumer electronics. elaborate system of API shim layers -- old API version still available, converts to the newer API and wraps back the expected format
 * GW: that works for a lot of stuff, but for example parsoid may change its HTML structure and that's harder to convert with a long-term window
 * yuri: may be good for internal code structure (progressive updates), plus clear path to kill things -- remove the shims when you're ready to drop

....
 * didn't we convert before from query.php?
 * yuri: yes but usage was relatively small and it was explicitly a trial; also a simpler migration path
 * tim: i have a feature request: mandatory API key with email address <- actually have a way to contact people. user-agents aren't customizable in the browser, and it's not always easy for people.
 * yuri: if you do that in GET you may damage caching; in POST it may as well be a header?
 * danielk: keys can be shared/stolen, hmm
 * tim: key identifies the author of the software, not the user
 * [key talk may need to be sidelined for now]
 * sumana: requiring keys with registration -> slippery slope to new policy requirements, make us like "the other guys"
 * sumana: what is the current roadmap for central repo for gadgets etc? (relates to timeline -- as we're doing this, maybe same time to do community management w/ api changes -> reduce friction)

...... ....
 * antoine: new versions, ....
 * something about language links API breaking due to internal changes not being fixed; somehow this got done wrong and exposed internal details to the outside

...... ... apple example...
 * danielk: not duplicating logic..... generating lists vs generating things out of lists (?)

Ideas to investigate

 * yuri: some good ideas in there:


 * shimming old versions
 * api key/useragent stuff .... discuss more

DataStore
The DataStore would be a basic key-value store to be available for extensions that don't need to maintain their own fancy tables.


 * get(key)
 * put(key, data)

It would be like BagOStuff but persistent. The design is generic -- potentially allows non-sql backend. It has simple, atomic operations -- nothing fancy.

It would be searchable by key prefix and would allow data migration between stores -- mostly a single key space, but you can specify separate storage spaces which aren't shared.
 * (kinda like DB settings but funkier)

So extensions can share, or not, if they choose.

Discussion

 * How is this actually different from BagOStuff with a persistent backend instead of memcached? [I didn't hear a real answer to this.]
 * [...] some other key-value store -- worth looking at?
 * max: but it's not invented here ;))


 * brion: use a basic api and wrap it -- then can hook up to fancier backends with dependencies. keep it simple in core.

Outcome

 * [brion & tim: yeah we're liking this so far]
 * [aaron: recommend adding getMulti options]

Storage service
We need a revision store for html, json+wikitext, etc for parsoid interaction with public content API:
 * lack of high-level interface for this and other storage use cases in MW
 * testing hard
 * unncecessary complexity for storage users
 * less storage primitive reuse than desirable (example: simple key-value store)
 * storage backend abstraction
 * share storage implementations (reuse across apps)
 * extensibility -- harder :D
 * scalability -- easily add more boxes, let the backend handle resharding?
 * reliability -- avoid SPOF, do replication cross-DC
 * use essentially the same URL scheme externally and internally
 * return the same content internally and externally, and make links in content work in both contexts without rewriting

Requests for comment/Storage service
first step: Rashomon revision store, a node.js rest server w/ cassandra backend
 * first use case -- parsoid html/jsonmeta/wikitext
 * in theory adjacent revisions are adjacent on disk and should compress well (transparent by cassandra)
 * [question: security/auth?]

PHP Virtual REST Service

 * example got running parallel requests, POST data etc
 * [brion would this be purely for MW's internal use, or shared with tools like Parsoid? How to maintain guarantees....]
 * tim: surprised to see titles in keys
 * GW: that's a bad example :D
 * tim: yay
 * GW: the longer answer is that rashomon implements non-destructive renames by just creating a new revision at the new names, but not moving the old revisions. Plus is that this works well with eventual consistent storage and you can answer 'how did URL X look at time Y'. Traditional history can be implemented by storing a summary of renames.
 * yuri - plugin arch? things created on fly as needed?
 * mark: overlap w/ max's rfc.......... maybe one could be a backend for the other
 * brad- Does it handle RevDel and such? Answer: Not yet. It could, maybe. [currently permissions would be handled at the above app layer, like how we enforce them in MW today. ES lets you fetch any text item, even if it's struck from its revision]
 * [doesn't work well when Gabriel intends this to be public-facing too]
 * [yeah we still need to work out some details here i think. but having parsoid ping revision storage without having to go through PHP overhead has a certain appeal to it]
 * [I just hope the answer isn't "reimplement the whole of MW's access control in nodejs" or whatever] eeek! [see the parsoid model :( ]
 * mark: features team implementing it so far, eventually may migrate to core :D
 * GW split some concerns: lets make a decision on the PHP interface RFC first as that overlaps with Datastore; discuss the backend & revision storage ideas later

Outcomes
going back to the API stuff we think we're going to refine this a bit
 * ACCEPT - Max's data store
 * ACCEPT - GW's more general PHP interface for services (with parallel operations ability)
 * DEFER / out of time - Storage service
 * DEFER / out of time - REST content API

= second set of notes =

Proposed Agenda:

3 minute lightning talk API Versioning

20 minute discussion

3 minute lightning talk Storage Services

20 minute discussion

3 minute lightning talk DataStore

20 minute discussion

API Versioning
Slides: https://docs.google.com/presentation/d/1H6JYzTR2V7RibJzuEc60EItqylGMijnh1YNMa4kokIE/edit?usp=sharing API is broken, but partly because it is hard to break backwards compatibility. API Versioning can help. What are the criteria for deprecation of APIs What are the strategies for migrating to new APIs Sumana: How do similar applications deal with this? Daniel: it's case-by-case. TOR: waiting is not a good strategy. identify the key users of an API. the long tail will always lag until something is turned off anyway. Bryan: Netflix up-transforms their API calls to newer versions Gabriel: It can work in some cases, but not infinitely Yuri: Having an extensible API (hooks, extensions etc) also makes that difficult. [I assume the internal API for controlling that would also be super complex] TheDJ: Scripts and gadgets on the wikis are not being maintained. Haven't we had this problem/discussion before? Tim: feature request for mandatory API keys Yuri: passing api key in get request kills caching Sumana: gadgets were eaiser because there's a central repo, what about a similar solution Antoine: Thinks current API is a bit messy and would support a new version 2 change with better architecture. Would deprecate the existing and not focus on backwards compatability. We are starting to hit the time limit
 * There is no one best practice
 * Often dictated for deprecation
 * Community "vibe" is conservative in this area
 * can use a combination of mechanisms (versioning, feature flag), depending on situation
 * content api may be a rewrite but we want to reuse
 * Siebrand makes a comment about how this being a communication problem too (hey, we changed something)
 * How do we get GOOG to switch from XML to json?
 * Easier with large users, harder with scripters/botters
 * Just because you get a deprecated warning doesn't mean it will be respected
 * Mark made a joke about inserting sleep into deprecated API calls. :) (was he joking?) :-) Mark was only half joking :)
 * You can also rewrite api calls internally, change params, redirect etc
 * Use the api mailing list. Pick a time frame (1 year, etc)
 * The Google approach is to kill services (but it's not very nice)
 * There is more logging being added in this area
 * Yuri says there is some effort to make user-agent fields with contact info (required?)
 * Require email registration
 * Fixes problem of user-agent modification techical challenges
 * You can write varnish rules to fix that (or use HTTP headers)
 * How would you prevent someone else from using the key
 * Tim: API key identifies developer of gadget
 * Brad: What keeps a script kiddy from using the key of a popular gadget?
 * Yuri: Do we force people to register? Isn't that against some policies and practices?
 * ALL: we should continue on mailing list
 * Brad: Similar to OAUTH keys
 * Sumana: If you are going to make a policy change then you need to ensure trust and prevent misuse/abuse
 * Issues with copying gadgets from one wiki to another with API keys?
 * TheDJ: create less friction by combining change
 * Yuri: If extensions are in our git repos we can search for problem use cases and fix them / notify developers
 * When there are security fixes related to gadgets, who is doing the maintenance
 * Yuri: You can invest in support or solve it by moving forward on the server - it's a trade off
 * Yuri: Current api is very big. It would be a lot of code to get rid of.
 * Antoine: Freeze old api and only add new features to version 2
 * Brion: How does the "API" rely on the internal implemenation? Isn't it the implementation of the "API" that has this dependency?
 * Antoine: Would like to see official client libraries in multiple languages that encapsulate these details
 * Tyler: +1 for client library. build an SDK.  There are also tools which generate API clients in multiple languages based on an API spec
 * There are also API calls which mirror / duplicate functionality of the web interface, there is duplicated logic
 * Yuri: orthogonal issue, implementation detail
 * Brion: Why aren't internal and external API the same?
 * RobLa: Sometimes there are very impactful breakages with slow moving clients (eg Apple desktop dictionary). Are there lots of slow moving clients like this?
 * Brion: there are OIA feeds, some other custom sofware users
 * Gabriel: comment about duplication of logic between Web UI/Special Page/API code. Refactoring core code into service layer addresses this issue.

Data Store
http://mediawiki.org/wiki/DataStore slides: https://docs.google.com/presentation/d/176xp-1ccpikLy043ESp5jeFL2X0sChS43T6TP4XqU48/edit?usp=sharing There are many use cases for storing simple blobs of data Very simple get/put API, schemaless, backend agnostic with support for migration by lowest common denominator restrictions Sample code link: https://gerrit.wikimedia.org/r/79029 Brion: What about key namespacing (per user), is there anything enforced by this proposal Max: No, that's up to the developer, use common sense. There is also one default store. Question about BagOfStuff persistence with some discussion of how that works Doctrine also has a key/value store Max: That is a big external dependency, also NIH. (Core does not have a clear external dependency policy yet) Brion: Seems like a good idea to have a simple interface like this in core Tim: Prepared to accept this RFC and discuss implementation details on mailing list Max: Wanted to make it an RFC to get more feedback from developers
 * BagOfStuff could be cleared accidentally by a cache clear
 * BagOfStuff doesn't have a "clear" method

Storage service
slides: https://docs.google.com/presentation/d/1H6JYzTR2V7RibJzuEc60EItqylGMijnh1YNMa4kokIE/edit?usp=sharing Gabriel: Brion: Is this an internal service for mediawiki, with things like auth enforced at the application layer? Gabriel: Authorization should be as late as possible Tim: Surprised to see title strings in the keys instead of page ids Gabriel: Just an example for the slides Tim: Points out that things like page renames/moves were very expensive in early versions of mediawiki Gabriel: Actions like renames should be non-destructive, graph operations to allow for finding old versions. Just create 1 new node for an action Yuri: Is the mediawiki engine then the rendering engine for content that comes from this data store? Gabriel: Thats the approach which will be used for parsoid, there may be other use cases Brad: If you can see versions from past what deals with vandal actions? Gabriel: Needs to be implemented but has ideas of how to add support Mark: There is some overlap with Max's proposal Gabriel: The interface here is more complex to support parallel operations, and there is the external API use case TheDJ: Are you proposing a new core storage manchanism or an additonal service? Gabriel: For use in Parsoid, may be interesting for others. Tim: What's the status - is this integrated into parsoid, what is the general application use case Gabriel: Not merged yet but the work is being done. There are other general use cases, yes Brion: Explaining some additional use cases, doesn't seem to overlap with Max's use case Tim: Springle making noises about the size of the revision table (sharding etc) and links tables. Would this be good for that? Faidon: Springle seems open to Cassandra for these sorts of use cases Gabriel: Wants to focus on the PHP Interface first [ I think there was some side chatter about the REST interface itself, vs the storage system ] Antoine: Can't we save to Swift? We are getting to have a lot of service dependencies (swift, elasticsearch, ...) Gabriel: We need an interface abstraction and ues the best technology behind it that exists Ops can figure it out. :) Gabriel: We are picking things based on the features that they have (cassandra is not good as a cache) Bryan: The PHP service interface can be a more generic entry point to many backends and this wasn't as clear in the presentation Gabriel: Yes, it's generic, you can map a namespace/prefix to a backend for instance Owen (that's me!): You don't necessarily have to choose between these two interfaces Gabriel: Sure, and one can be an implementation for the other TOR: So when can we have this? Gabriel: Rashomon will be tested in production soon. There is still some implementation work to be done. (buckets, auth, revision store).  2 months? TOR: What about the testing with parsoid Gabriel: Should be testing next week Brion: There's a lot of things to like here, potentially for storage of all kinds of things Gabriel: There are 3 RFC's here.  Proposes to leave out Rashomon for now and focus on the PHP/REST interface first. Brion: The parallel request feature has been sorely missing, likes the interface for parallel REST Gabriel: Max's data store is definitely a possible "local" implementation Tim: Accept Max's RFC. Consider Rashomon when it comes up later. Accept PHP client interface. Brad: API RFC got good feedback, will continue to refine on mailing list and wiki
 * need revision store for html, json and wikitext for parsoid
 * Wants a storage backend abstraction, and to be able to share storage implementations
 * Scalability, Reliability of course
 * Wants to have a public content API (exposing this directly also affects some implementation details, like links, domain names etc)
 * Sample implentation in javascript https://github.com/gwicke/rashomon (cassandra)
 * Performance details in slides
 * Example of simple versioned API endpoints
 * Implementation supports compression (down to 16%-18% of original size for wikitext) and immutable writes for revisions
 * Can work as a more generic storage bucket and also support more specialised use cases (counters)
 * Has a PHP service which talks to it, and supports parallel/batching of curl requests
 * Rashomon is missing auth, and bucket creation, PHP interface is not yet implemented (?)
 * What are the next steps

(Outcome clarified Feb 5 by Brion: "on the REST thing, I'm pretty sure we agreed to approve the interface, with an initial implementation using DataStore key-value as a backend")