Wikibase/Indexing

Goals

 * ideally, public web service
 * external requests return within a few seconds, use reasonable resources
 * how to enforce that constraint needs to be determined and influences the architecture
 * internal requests are allowed to use more resources & time
 * these need to not crash external requests and external cannot crush internal
 * high concurrency, failover / replication
 * needs to support continuous updates to reflect latest Wikidata state
 * Seconds or even a minute or two lag seems acceptable at this point but nothing beyond that.
 * support for queries that satisfy the needs of WikiGrok, cf. Extension:MobileFrontend/WikiGrok/Claim_suggestions
 * reasonable operational complexity

Candidate solution: Titan

 * Distributed graph database
 * Supports online modification (OLTP), so can reflect current state
 * Expressive query language (Gremlin); shared with other graph dbs like Neo4j
 * Implemented as a thin stateless layer on top of Cassandra or HBase: transparent sharding, replication and fail-over
 * async multi-cluster replication can be used for isolation of research clusters, DC fail-over
 * Supports relatively rich indexing, including complex indexes using ElasticSearch
 * Can gradually convert complex queries into simple(r) ones by propagating information on the graph & adding indexes

Candidate solution: Magnus' Wikidata Query service

 * Custom in-memory graph database implemented in C++
 * Relatively expressive, custom query language
 * Limited to a single machine
 * Current memory usage: 5G RSS

Open questions

 * Paging of large result sets
 * Handling of cycles in the graph