Wikibase/Indexing

Task tracking: wikidata-query-service in Phabricator; watch the project to receive notifications

End of March

 * Publicly downloadable prototype (query and import and update)
 * Labs install (details hazy, same prototype)
 * Hardware requested
 * 12 month roadmap

Phase 1: MVP (Before end of June)

 * Support use-cases of WikiGrok, remain flexible in architecture to eventually support external requests/third parties.

Must have:

 * query console for experimentation

As a user, I want to see a WikiGrok question immediately on the mobile site, as soon as I load an article that has any active WikGrok campaigns, so that I can respond quickly and keep getting more questions in real time, based on the ones I've already answered.

Requirements:
 * available through an API


 * available through server-side requests, connecting through PHP
 * for any query, result output in JSON (XML would be nice to have)
 * results may sometimes come in the form of lists (e.g., List of all possible occupations) and sometimes in the form of a single item
 * Simple (single-item) queries: (e.g., "is this item X and not Y?")
 * generate live and run quickly – as fast as possible – to serve immediately to users via WikiGrok (and potentially continue serving more results on the fly after user input)
 * Complex queries: (e.g., list of all possible suggested occupations)
 * pre-generate results and store in a table or cache, so these queries can run longer (but still within some reasonable timeframe, e.g. 1 hour)
 * regenerate as often as practical/possible

Phase 2: Support for public/external requests

 * ideally, public web service
 * external requests return within a few seconds, use reasonable resources
 * how to enforce that constraint needs to be determined and influences the architecture
 * internal requests are allowed to use more resources & time
 * these need to not crash external requests and external cannot crush internal
 * needs to support continuous updates to reflect latest Wikidata state
 * Seconds or even a minute or two lag seems acceptable at this point but nothing beyond that.
 * handle high request volumes (horizontal scaling)
 * handle a large data set (sharding)
 * robust: automatic handling of node failures, cross-datacenter replication, proven in production
 * reasonable operational complexity

Titan

 * Distributed graph database
 * Supports online modification (OLTP), so can reflect current state
 * Expressive query language (Gremlin); shared with other graph dbs like Neo4j
 * Implemented as a thin stateless layer on top of Cassandra or HBase: transparent sharding, replication and fail-over
 * async multi-cluster replication can be used for isolation of research clusters, DC fail-over
 * Supports relatively rich indexing, including complex indexes using ElasticSearch
 * Can gradually convert complex queries into simple(r) ones by propagating information on the graph & adding indexes
 * TinkerPop blueprints support, including Gremlin and the GraphSail RDF interface

Magnus' Wikidata Query service

 * Custom in-memory graph database implemented in C++
 * Relatively expressive, custom query language
 * Limited to a single machine
 * Current memory usage: 5G RSS
 * Load balanced systems are available
 * The initial dump conversion is extremely slow, so if a server process crashes and its dump gets corrupt you face a prolonged outage. Even if we optimize it by an order of magnitude, it will still be slow.
 * Server startup time is not nice either, and will only grow with the growth of the dataset. While not critical e.g. for Redis that we keep running for months, the less mature nature of WDQ and the need to cater for development/future bugs will mean a lot of restarts, each being a PITA.
 * You can't run in production the DB query used to retrieve latest changes is, so this part will have to be redone completely.
 * And this update routine results in a possible race condition making it miss some changes.
 * Each entity's properties are retrieved with a separate uncacheable HTTP request to Special:EntityData which isn't very fast so as the rate of changes increases, WDQ will bump into it hard, not being able to cope with updates.
 * Thread synchronization model needs to be totally redone, as currently used spinlocks aren't scalable.
 * Some code paths in the update code are missing a synchronization which makes one wonder how often does it crash and how much often will it crash under a production load. And if you add synchronization there, locking is probably going to bring everything to a halt.
 * With a crapload of very small objects in the same heap, a long-running server process would have problems with heap fragmentation/memory management performance, requiring custom memory management, etc.
 * Scalability would be very primitive: as you add more machines they start making more requests, etc.
 * Would need some manual solutions for approximate clustering, failover, etc.
 * Would require in-house maintenance.

OrientDB

 * Apache 2 License
 * Distributed graph database
 * Supports online modification (OLTP)
 * Supports Gremlin, as well as it's own SQL-like query language with graph features and no JOINs
 * Drop-in Lucene plugin for geospatial indexes (also full-text, not that we would use that)
 * Replication is multi-master and works via Hazelcast, read/write quorums, and merging conflicts
 * Isolation in ACID drops a bit when distributed (intermediate results may show briefly)
 * Isolation is SERIALIZABLE for direct FS access case, READ COMMITTED for remote access (what we would use)
 * Since I'd like to input data from hub feeds (starting with dumps), async replication isn't really an issue; both DC would just use the same process to pull in updates
 * Actually OrientDB 2.0 supports async replication via config
 * Supports automatic round-robin sharding as well as application directed sharding (specifying clusters for class item insertions and reads, reads default to checking all partitions)
 * Supports various indexes (SBTree,hash, both unique or not unique) and primitive as well as embedded data structures (sets/lists/map) that can be indexed
 * Queries on the JSON itself are also possible regardless of nesting levels
 * Supports Tinkerpop blueprints
 * Good query timeout support via TIMEOUT

ArangoDB

 * Apache 2 License
 * Multi-model database: key/value, document and graph DB with a query language combining all three models
 * Complex queries including joins between multiple collections
 * ACID: transactions spanning multiple documents and collections
 * distributed: replication (asynchronous master-slave) and sharding
 * AQL (ArangoDB Query Language), a declarative query language similar in spirit to SQL, designed with multi-model in mind and allowing joins. Has also other querying options.
 * HTTP API for RESTful interfaces
 * API extensible by user-defined JavaScript code (V8 embedded) via the Foxx framework
 * Supports relatively rich indexing (skip-list, hash, geo, full-text)
 * Extensive reference documentation in gitbook format
 * Convenient web front end integrated
 * Packages for Debian, other Linux distributions, Mac OSX and Windows available in repository on website
 * Virtual machine images and docker container available
 * Very modern code base (C++11) with many unit tests
 * Good community and professional support
 * Few open issues on github, new ones are quickly dealt with
 * Regular major releases every two to three months
 * Concerns
 * Memory-only indexes
 * Only one full-text index per collection, but a river plugin for elastic search exists
 * Complex clustering/sharding model which is still in development, integration with mesos is being worked on
 * No indexes on array properties yet, this is being worked on

Neo4j

 * GPL & AGPL
 * Some capabilities may be enterprise-only
 * Dynamic schema, supports fulltext indexes and probably geospatial with plugin, also looks like pluggable indexes are possible
 * Cypher SQL-like query language, TP3 support
 * Labels may be used to make efficient lookups for yes/no properties (see in the blog)
 * Has query planner/profiler
 * Good HA support, no sharding though
 * Lucene indexes
 * Concerns
 * Mixed indexes support - i.e. is it possible to use 2 indexes together?
 * Edge indexes only old-style - either manual-fill or pre-configured autoindexer which needs to be configured each time anew
 * Not clear how we handle geodata
 * No ability to store arbitrary JSON/blob for non-indexed secondary lookup
 * Multi-valued properties support - TP3 implementation uses node-per-value
 * What we're missing by running non-Enterprise version (if anything)
 * How sharding-less replication works in practice on our data size?
 * Questions
 * Are edge properties first-class citizens re: indexing?
 * Looking up property values - more efficient as edge properties or vertex properties? I.e. are claims edges or vertices?
 * Is Enterprise version just AGPL or has some other requirements?

Other possible candidates

 * "Big Data" - GPLv2, Tinkerpop 2.5
 * Graphx - Apache License, Alpha, Uses Spark
 * Virtuoso cluster
 * 4store
 * Neo4j - GPL and AGPL - Tinkerpop 3.0 reference implementation
 * Build one directly against Elasticsearch - Horrible/provocative idea - We currently have the expertise but totally against most of our goals – OTOH we were resigned to having to hack on Titan's Elasticsearch integration anyway

Open questions

 * Paging of large result sets
 * Handling of cycles in the graph
 * How to index the graph for efficient common query use cases
 * Efficient updates for materialized complex query results

Not candidates

 * Neo4j (replication only in "Enterprise Edition")
 * Offline / batch (OLAP) systems like Giraph (we need the capability to keep the index up to date)
 * Wikidata Query service: single node, no sharding, no replication
 * Caley: no serious production use
 * ArangoDB: looks like it might work but they don't allow upstream contributions. Answer to this: This was never true and is not true, see for the CLA procedure. When they get that problem solved and OrientDB and Titan fall through then we can promote them again.