Wikibase/Indexing

Task tracking: wikidata-query-service in Phabricator; watch the project to receive notifications

Goals / Requirements

 * ideally, public web service
 * external requests return within a few seconds, use reasonable resources
 * how to enforce that constraint needs to be determined and influences the architecture
 * internal requests are allowed to use more resources & time
 * these need to not crash external requests and external cannot crush internal
 * needs to support continuous updates to reflect latest Wikidata state
 * Seconds or even a minute or two lag seems acceptable at this point but nothing beyond that.
 * support for queries that satisfy the needs of WikiGrok, cf. Extension:MobileFrontend/WikiGrok/Claim_suggestions
 * handle high request volumes (horizontal scaling)
 * handle a large data set (sharding)
 * robust: automatic handling of node failures, cross-datacenter replication, proven in production
 * reasonable operational complexity

Titan

 * Distributed graph database
 * Supports online modification (OLTP), so can reflect current state
 * Expressive query language (Gremlin); shared with other graph dbs like Neo4j
 * Implemented as a thin stateless layer on top of Cassandra or HBase: transparent sharding, replication and fail-over
 * async multi-cluster replication can be used for isolation of research clusters, DC fail-over
 * Supports relatively rich indexing, including complex indexes using ElasticSearch
 * Can gradually convert complex queries into simple(r) ones by propagating information on the graph & adding indexes
 * TinkerPop blueprints support, including Gremlin and the GraphSail RDF interface

Magnus' Wikidata Query service

 * Custom in-memory graph database implemented in C++
 * Relatively expressive, custom query language
 * Limited to a single machine
 * Current memory usage: 5G RSS
 * The initial dump conversion is extremely slow, so if a server process crashes and its dump gets corrupt you face a prolonged outage. Even if we optimize it by an order of magnitude, it will still be slow.
 * Server startup time is not nice either, and will only grow with the growth of the dataset. While not critical e.g. for Redis that we keep running for months, the less mature nature of WDQ and the need to cater for development/future bugs will mean a lot of restarts, each being a PITA.
 * You can't run in production the DB query used to retrieve latest changes is, so this part will have to be redone completely.
 * And this update routine results in a possible race condition making it miss some changes.
 * Each entity's properties are retrieved with a separate uncacheable HTTP request to Special:EntityData which isn't very fast so as the rate of changes increases, WDQ will bump into it hard, not being able to cope with updates.
 * Thread synchronization model needs to be totally redone, as currently used spinlocks aren't scalable.
 * Some code paths in the update code are missing a synchronization which makes one wonder how often does it crash and how much often will it crash under a production load. And if you add synchronization there, locking is probably going to bring everything to a halt.
 * With a crapload of very small objects in the same heap, a long-running server process would have problems with heap fragmentation/memory management performance, requiring custom memory management, etc.
 * Scalability would be very primitive: as you add more machines they start making more requests, etc.
 * Would need some manual solutions for approximate clustering, failover, etc.
 * Would require in-house maintenance.

OrientDB

 * Apache 2 License
 * Distributed graph database
 * Supports online modification (OLTP)
 * Supports Gremlin, as well as it's own SQL-like query language with graph features and no JOINs
 * Drop-in Lucene plugin for geospatial indexes (also full-text, not that we would use that)
 * Replication is multi-master and works via Hazelcast, read/write quorums, and merging conflicts
 * Isolation in ACID drops a bit when distributed (intermediate results may show briefly)
 * Isolation is SERIALIZABLE for direct FS access case, READ COMMITTED for remote access (what we would use)
 * Since I'd like to input data from hub feeds (starting with dumps), async replication isn't really an issue; both DC would just use the same process to pull in updates
 * Actually OrientDB 2.0 supports async replication via config
 * Supports automatic round-robin sharding as well as application directed sharding (specifying clusters for class item insertions and reads, reads default to checking all partitions)
 * Supports various indexes (SBTree,hash, both unique or not unique) and primitive as well as embedded data structures (sets/lists/map) that can be indexed
 * Queries on the JSON itself are also possible regardless of nesting levels
 * Supports Tinkerpop blueprints
 * Good query timeout support via TIMEOUT

Cayley

 * Graph database "inspired by" Freebase and Google's Knowledge Graph

ArangoDB

 * Apache 2 License
 * Multi-model database: key/value, document and graph DB with a query language combining all three models
 * Complex queries including joins between multiple collections
 * ACID: transactions spanning multiple documents and collections
 * distributed: replication (asynchronous master-slave) and sharding
 * AQL (ArangoDB Query Language), a declarative query language similar in spirit to SQL, designed with multi-model in mind and allowing joins. Has also other querying options.
 * HTTP API for RESTful interfaces
 * API extensible by user-defined JavaScript code (V8 embedded) via the Foxx framework
 * Supports relatively rich indexing (skip-list, hash, geo, full-text)
 * Extensive reference documentation in gitbook format
 * Convenient web front end integrated
 * Packages for Debian, other Linux distributions, Mac OSX and Windows available in repository on website
 * Virtual machine images and docker container available
 * Very modern code base (C++11) with many unit tests
 * Good community and professional support
 * Few open issues on github, new ones are quickly dealt with
 * Regular major releases every two to three months

Other possible candidates

 * Virtuoso cluster
 * 4store

Open questions

 * Paging of large result sets
 * Handling of cycles in the graph
 * How to index the graph for efficient common query use cases
 * Efficient updates for materialized complex query results

Not candidates

 * Neo4j (replication only in "Enterprise Edition")
 * Offline / batch (OLAP) systems like Giraph (we need the capability to keep the index up to date)
 * RDF Databases (Wikidata's data model is more rich than RDF so there would likely be some loss of precision)
 * Wikidata Query service: single node, no sharding, no replication
 * Caley: no serious production use
 * ArangoDB: looks like it might work but they don't allow upstream contributions. When they get that problem solved and OrientDB and Titan fall through then we can promote them again.