Extension:Wikibase QueryEngine/SQL schema proposal

This page describes the (proposed) SQL schema to be used with the Wikibase QueryEngine.

wbq_entities

 * Description: Establishes the identity of entities used as "subjects" in the indexes.
 * Status: Tentative. This is largely redundant to wb_entity_per_page.
 * Columns
 * could consider using a (tiny)int instead
 * We may want a meta-info blob here, containing e.g. hashes over the respective entity's entries in each of the  tables. That information could then be used to determine which tables need to be updated when the entity was changed. This would minimize read- and especially write access when updating entities.
 * could consider using a (tiny)int instead
 * We may want a meta-info blob here, containing e.g. hashes over the respective entity's entries in each of the  tables. That information could then be used to determine which tables need to be updated when the entity was changed. This would minimize read- and especially write access when updating entities.


 * Expected size (for wikidata.org): a few dozen million. Roughly like the page table. One row per entitiy (data item, property, query, etc).
 * Typical queries:


 * Proposed Indexes:

wbq_mainsnak_

 * Description: Search index for a given value type, associating proverty/value pairs from statements' main snaks with the respective entities (subjects).
 * Status: Beta. Pretty clear, but We may need to add some columns still.
 * Number: About ten: One per value type (possibly, one per value type and entity type).
 * Columns
 * could consider using a (tiny)int instead
 * Used for efficient equality checks. Depending on the data type represented by the table, this may contain the actual value, or a hash of the value.
 * We may want another field in there indicating the entity type, to avoid joining against wbq_entities. Alternatively, the table could be split so there is one per value type and entity type.
 * We may want to have the statement GUID here too, but that would be a lot of overhead for no obvious benefit.
 * ...and type specific fields, see below.
 * Used for efficient equality checks. Depending on the data type represented by the table, this may contain the actual value, or a hash of the value.
 * We may want another field in there indicating the entity type, to avoid joining against wbq_entities. Alternatively, the table could be split so there is one per value type and entity type.
 * We may want to have the statement GUID here too, but that would be a lot of overhead for no obvious benefit.
 * ...and type specific fields, see below.
 * We may want to have the statement GUID here too, but that would be a lot of overhead for no obvious benefit.
 * ...and type specific fields, see below.


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. One row for each statement, split by value type.
 * Typical queries:
 * ...and type specific select queries, see below.
 * an auto-increment row_id column may be added if needed, but it doesn't seem to be necessary.
 * ...and type specific select queries, see below.
 * an auto-increment row_id column may be added if needed, but it doesn't seem to be necessary.
 * ...and type specific select queries, see below.
 * an auto-increment row_id column may be added if needed, but it doesn't seem to be necessary.


 * Proposed Indexes
 * for housekeeping
 * Could add `statement_rank` and `entity_type` to the index, so it would cover most retrieval queries.
 * for "data retrieval" queries
 * should perhaps be a HASH index. Though we might want to sort the result by entity_id.
 * used when updating items (subjects)
 * should perhaps be a HASH index: we are only intersted in identity, not in order.
 * used when deleting properties
 * should perhaps be a HASH index: we are only intersted in identity, not in order.
 * ...plus the indexes on type-specific fields as used for querying, see below.
 * used when deleting properties
 * should perhaps be a HASH index: we are only intersted in identity, not in order.
 * ...plus the indexes on type-specific fields as used for querying, see below.
 * ...plus the indexes on type-specific fields as used for querying, see below.

wbq_qualifier_

 * Description: Search index for a given value type, associating proverty/value pairs from statements' qualifier snaks with the respective entities (subjects) or statements.
 * Status: Tentative. May be left empty or removed for now. Schema should be changed to contain more information about the respective statement's main snak.
 * Number: About ten: One per value type (possibly, one per value type and entity type).
 * Columns:
 * for now, just like 
 * We may want to include the main snak's property_id.
 * We may even want to include the main snak's value hash too.


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. A few per statement, split by value type.
 * Typical queries:
 * It's still a bit unclear what a query involving qualifiers would typically look like.


 * Proposed Indexes: much like
 * If we have the main snak's property_id (and perhaps value_identity), these would be included in the indexes. This needs more thought.

wbq_ _string

 * Description: Index table for string values.
 * Status: Pretty definite
 * Number: already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Columns:
 * like or, respectively.
 * `value_identity` would store a hash of the string
 * We may want a `value_prefix` field, for sorting, etc.
 * We may want a full `value` field, for direct access or fulltext search.


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:
 * Initially, all queries will be for exact values, and can thus use value_identity. The examples below are for later use.
 * Prefix search:


 * Proposed Indexes:
 * the ones defined on and, respectively
 * , if we use the value_prefix column.

wbq_ _entityid

 * Description: Index table for entity IDs.
 * Status: Pretty definite
 * Number: already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Columns:
 * like or, respectively.
 * `value_identity` would store the ID of the target entity itself.


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:
 * All queries for this type of data will be for exact values, and can thus use value_identity. We could index and query the ID field, but that seems pointless.


 * Proposed Indexes:
 * just the ones defined on and, respectively

wbq_ _time

 * Description: Index table for time values.
 * Status: Pretty definite
 * Number: already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Columns:
 * like or, respectively.
 * `value_identity` would store a hash of the full value


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:


 * Proposed Indexes:
 * the ones defined on and, respectively
 * An RTREE index has been suggested for queries against ranges, should be investigated.
 * An RTREE index has been suggested for queries against ranges, should be investigated.
 * An RTREE index has been suggested for queries against ranges, should be investigated.

wbq_ _quantity

 * Description: Index table for quantities
 * Status: Pretty definite
 * Number: already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Columns:
 * like or, respectively.
 * `value_identity` would store a hash of the full value


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:


 * Proposed Indexes:
 * the ones defined on and, respectively
 * An RTREE index has been suggested for queries against ranges, should be investigated.
 * An RTREE index has been suggested for queries against ranges, should be investigated.
 * An RTREE index has been suggested for queries against ranges, should be investigated.

wbq_ _geo

 * Description: Index table for geo-coordinates
 * Status: Pretty definite
 * Number: already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Columns:
 * like or, respectively.
 * `value_identity` would store a hash of the full value


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:


 * Proposed Indexes:
 * the ones defined on and, respectively
 * note that value_globe has a very low cardinality (about 3), and one value (the one for earth) will be used in 99% of the entries. This it has been omitted from the index.
 * An RTREE index has been suggested for queries against ranges, should be investigated.
 * note that value_globe has a very low cardinality (about 3), and one value (the one for earth) will be used in 99% of the entries. This it has been omitted from the index.
 * An RTREE index has been suggested for queries against ranges, should be investigated.
 * note that value_globe has a very low cardinality (about 3), and one value (the one for earth) will be used in 99% of the entries. This it has been omitted from the index.
 * An RTREE index has been suggested for queries against ranges, should be investigated.

wbq_ _special

 * Description: Index tables for snaks which represent a special status of the value (e.g. indicating that no value exists, or an unknown value exists).
 * Status: Tentative, may be left empty or omitted at first.
 * Columns:
 * like or, respectively.
 * `value_identity` would store either 'novalue' or 'somevalue'


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:


 * Proposed Indexes:
 * the ones defined on and, respectively