Talk:Wikibase/DataModel/JSON

From mediawiki.org
Jump to navigation Jump to search

This page is helpful https://www.wikidata.org/wiki/Help:Wikidata_datamodel

not JSON[edit]

JSON doesn't include sequences of JSON objects so the dump is not really in JSON. This should be prominently mentioned.

Also to be mentioned - each entity is in the dump.

Did you look at the dump? It has the form [ {...}, {...}, ... ]. A sequence of (entity) objects (though the order is insignificant). -- Daniel Kinzler (WMDE) (talk) 12:34, 24 August 2016 (UTC)[]

Typo's?[edit]

  • In section #Time, it says: Universal time universal time. Looks like the link intended is w:Universal time.
  • It says: "In JSON dumps, each entity is encoded in as a single line". IMO the second "in" is to be removed (or is it my understanding of English?). - DePiep (talk) 17:08, 5 April 2018 (UTC)[]

Disputed interpretation[edit]

In an edit at wikidata:Help:Dates Jarekt changed a statements to read "Wikibase software interprets years 1801-1900 with precision 7 as 19th century" and "Wikibase software interprets years 1001-2000 with precision 6 as second millenium". There is discussion on the associated talk page.

I believe Jarekt is referring to the interactive user interface, but I consider it wrong to refer to that interface as "Wikibase software". I believe the JSON API is just as much a part of Wikibase software as the interactive interface. The JSON data model documentation does not use the term millennium at all, and only uses "century" in a section that speculates about future developments. The document states "That is, 1988-07-13T00:00:00 with precision 8 (decade) will be interpreted as 198?-??-??" and clearly this should be extended to interpreting precision 7, 100 years, as 18??-??-?? so interpreting 1900/precision 7 as being in the same range of uncertainty as 1801/precision 7 is incorrect, according to this document.

Empty labels/descriptions/aliases/sitelinks give empty array instead of object[edit]

On Wikidata, when an entity doesn't have a single label/description/alias/sitelink (probably also applies to claims as well, though I haven't verified), their corresponding property in the JSON will have an empty array as the value, rather than a plain object as usual. Is this intended? Here's an example: https://www.wikidata.org/wiki/Special:EntityData/Q61519072.json (if it's been changed, try finding an another item to look at on wikidata:Special:ItemsWithoutSitelinks). I imagine this may cause issues for clients in some cases. Luckily in JavaScript an array is just a specialized type of object, so JS clients may not even notice, but other client languages might not be so lucky. --NoInkling (talk) 01:33, 6 February 2019 (UTC) Turns out it's already in the bug tracker, I've added a link. --NoInkling (talk) 01:52, 6 February 2019 (UTC)[]

JSON model much too verbose for languages[edit]

  "labels": {
    "en": {
      "language": "en",
      "value": "New York City"
    },
    "ar": {
      "language": "ar",
      "value": "\u0645\u062f\u064a\u0646\u0629 \u0646\u064a\u0648 \u064a\u0648\u0631\u0643"
    }
    "fr": {
      "language": "fr",
      "value": "New York"
    },
  },
  "aliases": {
    "en": [
      {
        "language": "en",
        "value": "NYC"
      },
      {
        "language": "en",
        "value": "New York"
      },
    ],
    "fr": [
      {
        "language": "fr",
        "value": "New York City"
      },
    ],
  },
  "descriptions": {
    "en": {
      "language": "en",
      "value": "largest city in New York and the United States of America"
    },
    "it": {
      "language": "it",
      "value": "citt\u00e0 degli Stati Uniti d'America"
    }
  },

Why isn't it simply this?

  "labels": {
    "en": "New York City",
    "fr": "New York",
    "ar": "\u0645\u062f\u064a\u0646\u0629 \u0646\u064a\u0648 \u064a\u0648\u0631\u0643",
  },
  "aliases": {
    "en": [
      "NYC",
      "New York",
    ],
    "fr": [
      "New York city",
    ],
  },
  "descriptions": {
    "en": "largest city in New York and the United States of America",
    "it": "value": "citt\u00e0 degli Stati Uniti d'America",
  },

There's no need to create separate subobjects for each value grouped each one with their language code, when these values can be directly used as values for the language key (for labels and descriptions) or as values in an ordered list (with implicit numeric keys in this list) which is the value assigned to the same parent language key.

Could a new API using a simplified JSON be used, to save lot of memory while loading objects from queries? Note that the strings loaded from wikibase are NOT atomized, many keys are repeated, like "en", "fr", "New York city", and "New York" in this example) and considered as separate values. And each returned subobject is a separate array even if they have the same content (same {"language":"code", "value":"xyz"} when they could be simply a singlre string "xyz"). In Scribunto, each new object is counted even if it is garbage collected.

This model does not really help much the reuse of Wikidata; the only case where one would want more properties to the values (other than the language) would be if the value was associated with some other qualifiers; but there's no qualifier for labels, aliases and descriptions.

And in fact, given that aliases are also in an ordered list, the labels and aliases could be in the same array, where the first position is the label (and the following positions are aliases):

  "labels": {
    "en": [
      "New York City",
      "NYC",
      "New York",
    ],
    "ar": [
       "\u0645\u062f\u064a\u0646\u0629 \u0646\u064a\u0648 \u064a\u0648\u0631\u0643",
    ],
    "fr": [
      "New York",
      "New York city",
    ],
  },
  "descriptions": {
    "en": "largest city in New York and the United States of America",
    "it": "value": "citt\u00e0 degli Stati Uniti d'America",
  },

If ever you want later to have some (language,value) pair to have other qualifiers (like best-ranking, or usage, or grammatical info like gender) you can still replace the isolated string by an object with a required "value" key for the string, and other optional keys for the qualifiers and their own value, but still no need to import the language in it:

  "labels": {
    "fr": [
      {"value": "New York", "P<property-id for usage>": "common", "P<property-id for gender>": "Q<item-id for feminine>", },
      "New York city",
    ],
  },

But for now everything looks like if this JSON followed the model of RDF triples, without trying to compress the common references in triples, creating lot of unnecessary duplication of objects that should be the same. This has severe performance penalties. All these returned JSON data from queries should use a more compact format to reduce the memory footprint.

Note as well that there's still no easy way to query a specific property for an item (except labels, aliases and descriptions, which can be filtered for a specific language if needed when we don't care about resolving other fallback languages; or to get a list for the specified language(s) and their fallbacks), without loading all properties of the item. Here also this has a severe cost.

Note that labels/aliases and descriptions are just particular properties, exactly like the property "first name", except they don't need a property ID as theses properties have static names "label", "description".

This also means that the compression indicated above can be applied as well to values of a property and their attached qualifiers, i.e. their attached (prop-id, value) pairs.

Verdy p (talk) 15:51, 12 May 2020 (UTC)[]

As well the representation of claims is too much verbose with lot of redundancy: there are many requirements such as enforcing the presence of a required "type":"*" pair which can only be valid if its value matches an expected type (like "statement"). Such enforcement may just be checked by the client interface, then dropped from the returned data. As well statements listed for a given property always reference the property itself (whose Id should match). The "numeric-id" is not needed at all and can be dropped, it should even be deprecated completely (only the full id like "Q12" or "P50" as a string is accurate, not 12). The "hash":"*" pair is also not needed (they are equivalent to the unique object reference in Lua: once the object is built from this unique hash and has a reference, we don't need it in the client). All this consumes lot of data in memory (notably when we use getEntity() which loads everything).
getEntity() is very costly, there's no way to filter automatically the set of properties we really need (the only filter we have is with the getBestStatements() which is not enough).
We should be able to load an object partially, with ONLY the list of property ids (not necessarily sorted) that have matching statements (possibly filtered by minimum rank so we can discard immediately statements with insufficient ranks; as well as properties whose values are "special/unknown") so that we can check very fast if an object has a defined property with at least one defined value, and then just query those values (and their details like their type and rank) Another filter would say if we want the qualifiers or not (and just the list of qualifier types so we can manually check if some qualifiers are interesting to get, and then query the qualifiers selectively for a specific statement identified by the item, property type, and property number of that type if the property has multiple values).
With all these filters, Wikibase could be tuned to implement sublayers of caches (for now it caches only full items, up to 15, or full statements, up to 50; it could cache qualifiers separately, and the cache for items does not need to contain the statements and qualifiers, and the cache for statements need not contain the qualifiers directly: they can just remain weak references which can be autocleaned: it will be to the client application to maintain its own cache for the items, statements, qualifiers that it is really interested in and the default cache sizes containing ech less data per element could be increased with more elements than just 15 full items or 50 full statements). As well the cache for labels should be separated: the items in cache would jsut have a weak reference for labels/descriptions/aliases, with one entry per language of interest (this means that the language must be one of the available filters, the item keeping only a list of language codes, and the query could as well implement BCP47 for loading only languages needed for fallback resolution) Verdy p (talk) 17:09, 19 May 2020 (UTC)[]
Thank you. We're currently collecting input for API improvements and will take this into account. --Lydia Pintscher (WMDE) (talk) 13:49, 9 June 2020 (UTC)[]

Missing documentation about various item types[edit]

  1. In the “Top Level Structure” section:
    • Change this:
      type
      The entity type identifier. “item” for data items, and “property” for properties, “lexeme” for lexemes, “form” for lexicographic forms of lexemes, and “sense” for distinct semantic senses of lexemes (allows pairing lexemes across languistic translations)
      → This adds missing entity types.
  2. In the “Labels, Descriptions and Aliases” subsection:
    • Change this:
      Labels, descriptions and aliases are represented byusing the same basic data structure. For each language, there is a record usinga record (for labels and descriptions) or a list of records (for aliases) is associated to the language code, each record defining the following fields:
      → The structure for aliases is different from labels and descriptions, as the (language,value) records are nested in a distinct list for each language (these records per language are listed in arbitrary order, with non-significant number keys).
  3. In the “Statements” section:
    • Change this:
      typetype
      Always “statement”. (Historically, “claim” used to be another valid value here.)
      → This should be italic (its static value is always “statement”, it is clearly redundant, and may be safely removed from JSON data, just like “claim” was merged as equivalent)
    • Change this:
      mainsnakmainsnak
      The Snak representing the value to be associated with the property. See Snaks below. The Property specified in the main Snak must be the same as the Property the Statement is associated with.
      → This is a required item that only contains one snak (i.e. only containing single occurence of snaktype, datatype, datavalue, and property in its members), this encapsulation is not needed at all, all the members could be moved upward as direct members of the statement itself, without creating any conflict). “mainsnak” should not be in italic as long it is required (if it's no longer required, then it will be removed but its members merged upward in the declared item of any list of properties).
  4. In the “Snaks” subsection:
    • This section actually describes the “claims” type of statement (for now there's no other type of statement, such as “excludes” to create exclusions of the listed properties, “probably” for most probable (but not really asserted as there are possible cases where some claims would be wrong).
      → There should then be a prior “Statements” section saying that the value of “claims” is an unordered list of properties (keyed by property label).
      → There should be another prior “Properties” section saying that its value is an ordered list of snaks (without any key, but with an implicit integer index; as there's no "properties-order" member, this order is not really significant, but can be used by default (and in the Wikidata interface, there's no easy way to reorder the list of snaks for the same property, as well no way to reorder the list of aliases for items).
      Clearly this description is on the wrong section and should be moved up all before Snaks thelselves inside the Statement section, and before the list of members of actual statements (id, mainsnak, rank, qualifiers, references).
    • The top-level specification says that “Keys in JSON objects are unique, their order is not significant.” But what is then the use of “snaks-order”??? This makes no sense at all, except as being a hack for the Wikidata UI (there are other orders used in Wikidata that does not need such data, e.g. discriminating properties and constraints, or ordering properties in Wikidata; this order does not have to be followed in other sites that may order/group properties differently, notably infoboxes in Commons or various Wikipedia, each having their own prefered order of presentation) And the wikidata UI offers no way at all to easily reorder the set aliases for a given language, or the set of snaks for a given property (they are not even sorted automatically by rank!), or the set of properties in qualifiers and thier own internal order of snaks by property (the only existing way to change the order is by complex editing which is very errorprone and time-consuming: removing and readding them at end of the lists: this is a defect of the Wikidata UI, not of Wikibase); in all cases, "snaks-order" has no meaning at all and should not even be exposed in JSON data (note that some clients expect an order in lists of properties or lists of values, notably for Infoboxes which often have restrictions: they are displaying only 1 or a small number of values, and it will be the first ones found in the JSON order for unkeyed lists even if this order is not significant, or the first properties using the quite arbitrary key order; in all cases, the “snaks-order”, which lists one or more property ids (and not necessarily all), plays no role at all as this is not this partial order that really limits the number of values displayed when there's such limitation applied)!
    • Change this:
      propertyproperty
      The ID of the property this Snak is about.
      → This is always a redundant member which can be safely removed (and in fact snaks could be created for other objects than just properties, for example a snak would be set for lexicographic forms to indicate that a lexeme has no feminine form or no plural form, or no grammatical case-sensitive form, or no form for some subject or for some conjugating tense or mode; in that case there would be no property to refer, so this field would necessarily become optional). As the “property” member is not required at all, it should be italic
  5. In the “Data Values” subsection, some value types are presented, but the list of possible values for “[snak].datavalue.type”, with the respective type or structure for their associated “[snak].datavalue.value” field is incomplete:
    • wikibase-entityid
      Entity IDs are used to reference entities on the same repository. They are represented by a map structure containing three fields:
      • “entity-type”: defines the type of the entity, such as “item”, or“property”, “lexeme”, “form”, or “sense”.
      • “id”: the full entity ID.
      • “numeric-id”: for some entity types (item, property), the numeric part of the entity ID.
        WARNING: not all entity IDs have a numeric ID – using the full ID is highly recommended.
      → These “datavalue.type”s also match with the “datatype” member of any snak set respectively to “wikibase-item”, “wikibase-property”, “wikibase-form”, “wikibase-sense” (which are undocumented, these are actually the same type but the values of the two fields in JSON are distinct, without or without the “wikibase-” prefix; there's also no the “wikibase-” prefix at the top-level, but it is always exposed in “datatype” member of snaks where it should be removed, so that “datavalue.type” without this prefix would become redundant).
      → So there's a reduncancy between “datatype” (actually a subtype) and “datavalue.type” (the parent type) which also exists when the “datatype (subtype) is “string” (whose parent type in “datavalue.type” is also “string” without needing another "wikibase-" prefix). The doc is really unclear about this.
    • string
      commonsMedia
      external-id
      geo-shape
      tabular-data
      math
      musical-notation
      url
      These are all based on a base type “string” in “[snak].datavalue.type”. They are language-independant. Only the “[snak].datatype”=“string” is unrestricted and generally used for various codes and notations, when other subtypes are not suitable (with their custom editing modes in Wikidata or custom rendering presentation, like geo shapes, maths formulas, and references to external databases by id or url, including medias in Commons, with the exception of "site links" which are associated to base entities and represented separately outside of their statements).
      → These subtypes are missing and should be added to the list.
      → Note that the subtype “commonsMedia” in “[snak].datavalue.type” is the only one exposed in JSON in camelCase instead of using lowercase only and hyphens “-” between words (another incoherence)...
    • monolingualtext
      Its value is a record (used instead of base type “string”) for text which is language-dependant but which is not the label or one of the aliases for another entity (if this is the case, the “wikibase-entityid” should be used instead), but it may be the label or one of the aliases for the current entity (for example in the value of some property for the "official name" of the entity in a designated language), or a code. The record containing two fields:
      • “language”: a language code.
      • “text”: the actual text in that language.
      → This subtype is used but missing in the documentation, it should be added to the list of possible values for “[snak].datavalue.type”. It's in fact the same type of record used for individual labels, individual descriptions, and individual items listed in aliases, except that the record field containing the actual text is named “text” instead of “value”. (Its base type in “[snak].type” is also “monolingualtext”, which currently has no other derived subtype in “[snak].datavalue.type”; other possible subtypes could eventually be defined for language-dependant math formulas or urls; in property values of base type “string”, the language can be specifed with qualifiers; subtypes may however be needed for language-dependant values of qualifiers)
    • globecoordinate
      → this subtype (used in “[snak].datavalue.type”) has for now a single derived type, named differently as “globe-coordinate” (with the hyphen!) in “[snak].datatype”, no other types are defined.

In summary the subtypes used in “[snak].datavalue.type” (and listed in the JSON model) have no value at all in applications, it's just an internal representation for effective values defined more completely in snaks, either with the “[snak].datatype” field (only when “[snak].snaktype” indicates a “value”), or with “[snak].snaktype” (which indicates special values like “novalue” and “somevalue”, that have no significant internal representation needing “[snak].datatype” and “[snak].datavalues”). So this JSON model should just expose all internal representation of datavalues, separately from “[snak].datatype” which is extensible (and actually used by client applications) which has many more types, listed completely in another special page and used to define entities (properties/constraints, qualifiers, etc.).

In summary: the presentation of this page is very confusive, it is badly structured (very bad if this is an official API documentation), incomplete (missing types), but unfortunately blocked from editing. The whole page should be rewritten. The blocking message instructs us to go to a GitHub page which also does not exit at all (or was removed, because there was no longer any maintenance there). The addition of lexical items in Wikibase for Wikidata made this page even worse ! It is clearly unusable as is. The only thing we can do is to inspect the JSON data itself, and... guess. And the "Wikidata" module does not remove any of the redundancy, and it has very poor performance, using lot of memory or CPU tile, having no filters at all (it explodes on Commons).

All this demonstrates that the JSON schema was not properly tested and documented coherently. This makes Wikidata costly to use in Lua modules (e.g. in Commons or Wikipedia) and other applications. All these are in fact bugs of Wikibase in its PHP implementation (not really of Wikidata itself, which should have made this). Wikibase should be rework to expose another coherent JSON datamodel, without breaking its genericity, but using much less verbose and unnecessary/incoherent data output which is hard to parse. The terminology adopted also mixes several uses of the term "type" where it should be "basetype" or "subtype"... Statements, properties, snaks (individual property values) have to be reformulated much more clearly.

I'm about to rewrite an alternative module to the existing Wikidata module (based only on mw.wikibase), that will compress the data, parse the snak, will allow filters for properties or ranks we are not interested in, so that we can save memory and avoid costly repeated queries (because the native Wikibase client in PHP has very small caches). Verdy p (talk) 15:22, 21 May 2020 (UTC)[]

General Row Structure?[edit]

I think it makes sense to also add this section in right above the Top Level Structure section to make it more clear for our users:

General Row Structure

Programmatically and for performance, we provide it in this flexible format so that you do not necessarily have to treat or read as one huge JSON array, but instead can be streamed or read by any line reader program, library, or tool (JSON lines, ETL tools, etc.) optionally skipping the first and last line as a column header and footer, and parsing each line (no significant ordering) as a unique JSON object representing an entity in Wikidata at the time of the dump.

[
{},
{},
{}
]

Thoughts, Agreement? --Thadguidry (talk) 17:47, 8 December 2020 (UTC)[]