Extension:Wikibase QueryEngine/SQL schema proposal

This page describes the (proposed) SQL schema to be used with the Wikibase QueryEngine.

wbq_entities

 * Description: Establishes the identity of entities used as "subjects" in the indexes.
 * Status: Tentative. This is largely redundant to wb_entity_per_page.
 * Columns
 * We may want a meta-info blob here, containing e.g. hashes over the respective entity's entries in each of the  tables. That information could then be used to determine which tables need to be updated when the entity was changed. This would minimize read- and especially write access when updating entities.
 * We may want a meta-info blob here, containing e.g. hashes over the respective entity's entries in each of the  tables. That information could then be used to determine which tables need to be updated when the entity was changed. This would minimize read- and especially write access when updating entities.
 * We may want a meta-info blob here, containing e.g. hashes over the respective entity's entries in each of the  tables. That information could then be used to determine which tables need to be updated when the entity was changed. This would minimize read- and especially write access when updating entities.


 * Expected size (for wikidata.org): a few dozen million. Roughly like the page table. One row per entitiy (data item, property, query, etc).
 * Typical queries:


 * Proposed Indexes:

wbq_mainsnak_

 * Description: Search index for a given value type, associating proverty/value pairs from statements' main snaks with the respective entities (subjects).
 * Status: Beta. Pretty clear, but We may need to add some columns still.
 * Number: About ten: One per value type (possibly, one per value type and entity type).
 * Columns
 * could perhaps be 3x BIG INT for performance
 * for some value types, we may not need or want this, because the value is self-identified (e.g. short strings or simple numbers).
 * we could call this value_identity and use it to store the value proper or a value hash or some other kind of fingerprint, depending on the value type.
 * We may want another field in there indicating the entity type, to avoid joining against wbq_entities. Alternatively, the table could be split so there is one per value type and entity type.
 * We may want to have the statement GUID here too, but that would be a lot of overhead for no obvious benefit.
 * The statement's rank would need to go here if we decided to let users filter by rank at will. If we do not include the rank here, we have to decide which values to include and which to ignore while generating the index.
 * ...and type specific fields, see below.
 * We may want to have the statement GUID here too, but that would be a lot of overhead for no obvious benefit.
 * The statement's rank would need to go here if we decided to let users filter by rank at will. If we do not include the rank here, we have to decide which values to include and which to ignore while generating the index.
 * ...and type specific fields, see below.


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. One row for each statement, split by value type.
 * Typical queries:
 * ...and type specific select queries, see below.
 * an auto-increment row_id column may be added if needed, but it doesn't seem to be necessary.
 * ...and type specific select queries, see below.
 * an auto-increment row_id column may be added if needed, but it doesn't seem to be necessary.
 * ...and type specific select queries, see below.
 * an auto-increment row_id column may be added if needed, but it doesn't seem to be necessary.


 * Proposed Indexes
 * would need to include the rank too, probably as the first field.
 * ...plus the indexes on type-specific fields as used for querying, see below.
 * would need to include the rank too, probably as the first field.
 * ...plus the indexes on type-specific fields as used for querying, see below.
 * ...plus the indexes on type-specific fields as used for querying, see below.

wbq_qualifier_

 * Description: Search index for a given value type, associating proverty/value pairs from statements' qualifier snaks with the respective entities (subjects) or statements.
 * Status: Tentative. May be left empty or removed for now. Schema should be changed to contain more information about the respective statement's main snak.
 * Number: About ten: One per value type (possibly, one per value type and entity type).
 * Columns: for now, just like
 * We may want to include the main snak's property_id.
 * We may even want to include the main snak's value hash too.


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. A few per statement, split by value type.
 * Typical queries:
 * It's still a bit unclear what a query involving qualifiers would typically look like.


 * Proposed Indexes: much like
 * If we have the main snak's property_id (and perhaps value_hash), these would be included in the indexes. This needs more thought.

wbq_ _string

 * Description: Index table for string values.
 * Status: Pretty definite, but using a separate prefix columsn should be considered.
 * Number: already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Columns:
 * like or, respectively.
 * We may want to store a prefix in a varchar, for sorting, etc.
 * We may also have a "short string" type in addition to the general "string" type, for values < 255 bytes.
 * We may also have a "short string" type in addition to the general "string" type, for values < 255 bytes.


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:
 * Initially, all queries will be for exact values, and can thus use value_hash. The examples below are for later use.
 * Prefix search:
 * we will probably need to also filter by entity type, using a join against wbq_entities or a where-condition on an extra column.


 * Proposed Indexes:
 * ...or the respective column containing the prefix for sorting, etc.
 * ...or the respective column containing the prefix for sorting, etc.

wbq_ _entityid

 * Description: Index table for entity IDs.
 * Status: Pretty definite, but the type column may be removed.
 * Number: already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Columns:
 * this is probably not needed
 * this is probably not needed
 * this is probably not needed


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:
 * All queries for this type of data will be for exact values, and can thus use value_hash. We could index and query the ID field, but that seems pointless.


 * Proposed Indexes:
 * none needed

wbq_ _time

 * Description: Index table for time values.
 * Status: Not implemented at present, tentative design.
 * Number: already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Columns:
 * (this is a rough proposal):


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:
 * Note that reange queries like the one above are not in the initial feature set, but queries on the time data type really don't make sense for exact values (exact what - year? day? milisecond?)
 * Note that reange queries like the one above are not in the initial feature set, but queries on the time data type really don't make sense for exact values (exact what - year? day? milisecond?)


 * Proposed Indexes:

wbq_ _quantity

 * Description: Index table for quantities
 * Status: Not implemented at present, tentative design.
 * Number: already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Columns:


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:
 * Note that reange queries like the one above are not in the initial feature set, but queries on exact quantities make little sense (most quantities arn't even defined as a precise value, but with margins of uncertainty).
 * Note that reange queries like the one above are not in the initial feature set, but queries on exact quantities make little sense (most quantities arn't even defined as a precise value, but with margins of uncertainty).


 * Proposed Indexes:

wbq_ _geo

 * Description: Index table for geo-coordinates
 * Status: Implemented, but needs more thought.
 * Number: already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Columns:
 * should probably be a range (min_lat, max_lat)
 * should probably be a range (min_lon, max_lon)
 * deprecated field, should probably just go
 * lat/lon can only be interpreted based on the globe identifier.
 * This can probably be removed
 * deprecated field, should probably just go
 * lat/lon can only be interpreted based on the globe identifier.
 * This can probably be removed
 * This can probably be removed
 * This can probably be removed


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:
 * Should probably support range-type queries over the lat and lon columns, like the tables for time and quantity types. It makes no sense to search for geo-coordinates with absolute precision, down to the micrometer.
 * Note that all queries should check the globe first before comparing the lat/long ranges.


 * Proposed Indexes:
 * Range based indexes for latitude and longitude, similar to the ones used by the tables for time and quantity types.
 * Indexes should likely be composite, with globe being the first component.

wbq_valueless_snaks

 * Description: Index tables for snaks without a specific value (no-value snaks and some-value snaks).
 * Status: Tentative, may be left empty or omitted at first. Unclear whether we need this at all. It would perhaps be easier to treat "novalue" and "somevalue" like special value types. Needs more thought.
 * Columns:


 * Expected size (for wikidata.org): A few hundred million. On the order of the pagelinks table. Already covered by the counts for wbq_mainsnak_ and wbq_qualifier_ above.
 * Typical queries:


 * Proposed Indexes: