Wikibase/Indexing

Task tracking: wikidata-query-service in Phabricator; watch the project to receive notifications

Goals / Requirements

 * ideally, public web service
 * external requests return within a few seconds, use reasonable resources
 * how to enforce that constraint needs to be determined and influences the architecture
 * internal requests are allowed to use more resources & time
 * these need to not crash external requests and external cannot crush internal
 * needs to support continuous updates to reflect latest Wikidata state
 * Seconds or even a minute or two lag seems acceptable at this point but nothing beyond that.
 * support for queries that satisfy the needs of WikiGrok, cf. Extension:MobileFrontend/WikiGrok/Claim_suggestions
 * handle high request volumes (horizontal scaling)
 * handle a large data set (sharding)
 * robust: automatic handling of node failures, cross-datacenter replication, proven in production
 * reasonable operational complexity

Titan

 * Distributed graph database
 * Supports online modification (OLTP), so can reflect current state
 * Expressive query language (Gremlin); shared with other graph dbs like Neo4j
 * Implemented as a thin stateless layer on top of Cassandra or HBase: transparent sharding, replication and fail-over
 * async multi-cluster replication can be used for isolation of research clusters, DC fail-over
 * Supports relatively rich indexing, including complex indexes using ElasticSearch
 * Can gradually convert complex queries into simple(r) ones by propagating information on the graph & adding indexes
 * TinkerPop blueprints support, including Gremlin and the GraphSail RDF interface

Magnus' Wikidata Query service

 * Custom in-memory graph database implemented in C++
 * Relatively expressive, custom query language
 * Limited to a single machine
 * Current memory usage: 5G RSS
 * Problematic locking strategy
 * Pull updates would need to be rewritten for production use
 * Would need some manual solutions for approximate clustering, failover, etc.
 * Would require in-house maintenance

OrientDB

 * Apache 2 License
 * Distributed graph database
 * Supports online modification (OLTP)
 * Supports Gremlin, as well as it's own SQL-like query language with graph features and no JOINs
 * Drop-in Lucene plugin for geospatial indexes (also full-text, not that we would use that)
 * Replication is multi-master and works via Hazelcast, read/write quorums, and merging conflicts
 * Isolation in ACID drops a bit when distributed (intermediate results may show briefly)
 * Isolation is SERIALIZABLE for direct FS access case, READ COMMITTED for remote access (what we would use)
 * Since I'd like to input data from hub feeds (starting with dumps), async replication isn't really an issue; both DC would just use the same process to pull in updates
 * Actually OrientDB 2.0 supports async replication via config
 * Supports automatic round-robin sharding as well as application directed sharding (specifying clusters for class item insertions and reads, reads default to checking all partitions)
 * Supports various indexes (SBTree,hash, both unique or not unique) and primitive as well as embedded data structures (sets/lists/map) that can be indexed
 * Queries on the JSON itself are also possible regardless of nesting levels
 * Supports Tinkerpop blueprints
 * Good query timeout support via TIMEOUT

Cayley

 * Graph database "inspired by" Freebase and Google's Knowledge Graph

ArangoDB

 * Key-value, document and graph DB with replication (asynchronous master-slave) and sharding
 * Strong consistency / ACID / transactions
 * AQL (ArangoDB Query Language), a declarative query language similar to SQL. But also has other querying options.
 * Supports relatively rich indexing
 * Packages for Debian available in repository on website

Other possible candidates

 * Virtuoso cluster
 * 4store

Open questions

 * Paging of large result sets
 * Handling of cycles in the graph
 * How to index the graph for efficient common query use cases
 * Efficient updates for materialized complex query results

Not candidates

 * Neo4j (replication only in "Enterprise Edition")
 * Offline / batch (OLAP) systems like Giraph (we need the capability to keep the index up to date)
 * RDF Databases (Wikidata's data model is more rich than RDF so there would likely be some loss of precision)
 * Wikidata Query service: single node, no sharding, no replication
 * Caley: no serious production use