Extension talk:Wikibase QueryEngine/SQL schema proposal

Add topic
From mediawiki.org
Latest comment: 10 years ago by Duesentrieb in topic Some comments

Some comments[edit]

  • wbq_entities
  • wbq_mainsnak_<type>
    • Same stuff with 'id' field as above
    • PRIMARY KEY is kind of long, especially for the expected table size, unless the index is well sorted for full-row lookups of the vast majority of queries, it might make sense to just make this UNIQUE and pick another index (or surrogate) to be the PRIMARY. The hash column could save some space as base36.
  • wbq_qualifier_<type>
    • (similar to above)
  • wbq_<role>_string
    • (similar to above)
    • Also, how big can 'value' be? I care more if it will be indexed.
  • wbq_<role>_entityid
    • What is the PRIMARY KEY?
  • wbq_<role>_time
    • I guess that would need a surrogate key
  • wbq_<role>_quantity
    • I guess that would need a surrogate key
    • I wonder if RTREE indexes might be useful, depending on the final column layout, this kind of table is tricky SQL
  • wbq_<role>_geo
    • I guess that would need a surrogate key
    • Would be nice to see the proposed index statements (would some use RTREE?), this is another tricky one
    • Why is 'alt' deprecated?
  • wbq_valueless_snaks
    • Blobs in the Primary Key are a little worrying, actually why are they not varbinary? Even still, it would make the PRIMARY kind of long

Aaron (talk) 00:07, 8 November 2013 (UTC)Reply

Thanks for the comments Aaron! I'll reply section by section:

  • wbq_entities
    • Are 'id' values thinks like Q<hex>? Can NUMERIC be used instead? We will have non-numeric IDs for future types of items, like media meta-data, sub-entities for word senses, etc.
    • Can constants be used for 'type' so it can be UNSIGNED INT? In theory yes, but that means a schema change on the live site when we introduce a new entity type. And since different entity types can be defined by different extensions, there's no single place to maintain the full list of possible entity types. It's configurable and depends on which extensions are present.
      • Not following why adding a constant to the app layer would require a schema change? SPringle (WMF) (talk) 00:29, 12 November 2013 (UTC)Reply
        • Sorry, I was thinking of a MySQL ENUM. For just using constants: this would be inconsistent with the use of entity types elsewhere in wikibase, both in the database and the code. When we considered the same thing for the ContentHandler (i.e. page_content_model), a decision was made (mostly by Tim) to use human readable strings rather than more brittle/obscure numeric identifiers. The same reasoning holds here, I think, unless there are compelling reasons to use an integer here. -- Duesentrieb 16:55, 13 November 2013 (UTC)Reply
      • Not being able to predict entity types seems like a code smell or normalization issue. A field performing two roles should probably be two fields. SPringle (WMF) (talk) 00:29, 12 November 2013 (UTC)Reply
  • wbq_mainsnak_<type>
    • Same stuff with 'id' field as above
    • PRIMARY KEY is kind of long, especially for the expected table size, unless the index is well sorted for full-row lookups of the vast majority of queries, it might make sense to just make this UNIQUE and pick another index (or surrogate) to be the PRIMARY. The hash column could save some space as base36.
      • the vast majority of queries would select by property_id and value_hash (aka value_identity), the subject_id is added to the key to make it unique and allow paging.
      • we could use an auto-increment row ID as the primary, and make the above a "simple" unoique key. But how woudl that be better? Doesn't a unique key use just as much room as a primary key?
  • wbq_<role>_string
    • Also, how big can 'value' be? I care more if it will be indexed. I tend to say that we can limit it to a 255 byte prefix, and use the value_hash (aka value_identity) if we want to check fro equality of longer strings. The prefix would then be used only for sorting and prefix matches.
  • wbq_<role>_entityid
    • What is the PRIMARY KEY? See wbq_mainsnak_<type>. wbq_mainsnak_entityid would have the fields and indexes described for wbq_mainsnak_<type> as well as the ones described for wbq_<role>_entityid. The value_hash would be based on the entity id (or we would just use a value_identity column to contain the entity id - then we wouldn't need any additional column at all to represent entity ids).
  • wbq_<role>_time
    • I guess that would need a surrogate key. No, since what I specified for wbq_mainsnak_<type> applies.
  • wbq_<role>_quantity
    • I wonder if RTREE indexes might be useful, depending on the final column layout, this kind of table is tricky SQL. We should investigate this (as well as the possibility to just not have this in MySQL, and use Elastic for this kind of search). I don't know anything about RTREE.
  • wbq_<role>_geo
    • Why is 'alt' deprecated? Because from the perspective of the data model, it's more useful to treat the altitude separately (different source, different method of measurement, different margin of error, different unit, etc).
  • wbq_valueless_snaks
    • Blobs in the Primary Key are a little worrying, actually why are they not varbinary? This is an oversight, they should indeed by varbinary.

I'll try to make the draft a bit more concise when I have a bit more feedback. I currently tend to replace value_hash with value_identity (which may contain a hash or, in the trivial case e.g. for entityid, the full value). -- Duesentrieb ⇌ 10:57, 9 November 2013 (UTC)Reply