Wikibase/Indexing

Task tracking: wikidata-query-service in Phabricator; watch the project to receive notifications

Goals / Requirements

 * ideally, public web service
 * external requests return within a few seconds, use reasonable resources
 * how to enforce that constraint needs to be determined and influences the architecture
 * internal requests are allowed to use more resources & time
 * these need to not crash external requests and external cannot crush internal
 * needs to support continuous updates to reflect latest Wikidata state
 * Seconds or even a minute or two lag seems acceptable at this point but nothing beyond that.
 * support for queries that satisfy the needs of WikiGrok, cf. Extension:MobileFrontend/WikiGrok/Claim_suggestions
 * handle high request volumes (horizontal scaling)
 * handle a large data set (sharding)
 * robust: automatic handling of node failures, cross-datacenter replication, proven in production
 * reasonable operational complexity

Titan

 * Distributed graph database
 * Supports online modification (OLTP), so can reflect current state
 * Expressive query language (Gremlin); shared with other graph dbs like Neo4j
 * Implemented as a thin stateless layer on top of Cassandra or HBase: transparent sharding, replication and fail-over
 * async multi-cluster replication can be used for isolation of research clusters, DC fail-over
 * Supports relatively rich indexing, including complex indexes using ElasticSearch
 * Can gradually convert complex queries into simple(r) ones by propagating information on the graph & adding indexes
 * TinkerPop blueprints support, including Gremlin and the GraphSail RDF interface

Magnus' Wikidata Query service

 * Custom in-memory graph database implemented in C++
 * Relatively expressive, custom query language
 * Limited to a single machine
 * Current memory usage: 5G RSS

Cayley

 * Graph database "inspired by" Freebase and Google's Knowledge Graph

ArangoDB

 * Key-value, document and graph DB with replication (asynchronous master-slave) and sharding
 * Strong consistency / ACID / transactions
 * AQL (ArangoDB Query Language), a declarative query language similar to SQL. But also has other querying options.
 * Supports relatively rich indexing
 * Packages for Debian available in repository on website

Other possible candidates

 * Virtuoso cluster
 * 4store

Open questions

 * Paging of large result sets
 * Handling of cycles in the graph
 * How to index the graph for efficient common query use cases
 * Efficient updates for materialized complex query results

Not candidates

 * Neo4j (replication only in "Enterprise Edition")
 * Offline / batch (OLAP) systems like Giraph (we need the capability to keep the index up to date)
 * RDF Databases (Wikidata's data model is more rich than RDF so there would likely be some loss of precision)
 * Wikidata Query service: single node, no sharding, no replication
 * Caley: no serious production use