Talk:Wikibase/DataModel/JSON

This page is helpful https://www.wikidata.org/wiki/Help:Wikidata_datamodel

not JSON
JSON doesn't include sequences of JSON objects so the dump is not really in JSON. This should be prominently mentioned.

Also to be mentioned - each entity is in the dump.


 * Did you look at the dump? It has the form [ {...}, {...}, ... ]. A sequence of (entity) objects (though the order is insignificant). -- Daniel Kinzler (WMDE) (talk) 12:34, 24 August 2016 (UTC)

Typo's?

 * In section #Time, it says: Universal time universal time. Looks like the link intended is w:Universal time.
 * It says: "In JSON dumps, each entity is encoded in as a single line". IMO the second "in" is to be removed (or is it my understanding of English?). - DePiep (talk) 17:08, 5 April 2018 (UTC)

Disputed interpretation
In an edit at wikidata:Help:Dates Jarekt changed a statements to read "Wikibase software interprets years 1801-1900 with precision 7 as 19th century" and "Wikibase software interprets years 1001-2000 with precision 6 as second millenium". There is discussion on the associated talk page.

I believe Jarekt is referring to the interactive user interface, but I consider it wrong to refer to that interface as "Wikibase software". I believe the JSON API is just as much a part of Wikibase software as the interactive interface. The JSON data model documentation does not use the term millennium at all, and only uses "century" in a section that speculates about future developments. The document states "That is, 1988-07-13T00:00:00 with precision 8 (decade) will be interpreted as 198?-??-??" and clearly this should be extended to interpreting precision 7, 100 years, as 18??-??-?? so interpreting 1900/precision 7 as being in the same range of uncertainty as 1801/precision 7 is incorrect, according to this document.

Empty labels/descriptions/aliases/sitelinks give empty array instead of object
On Wikidata, when an entity doesn't have a single label/description/alias/sitelink (probably also applies to claims as well, though I haven't verified), their corresponding property in the JSON will have an empty array as the value, rather than a plain object as usual. Is this intended? Here's an example: https://www.wikidata.org/wiki/Special:EntityData/Q61519072.json (if it's been changed, try finding an another item to look at on wikidata:Special:ItemsWithoutSitelinks). I imagine this may cause issues for clients in some cases. Luckily in JavaScript an array is just a specialized type of object, so JS clients may not even notice, but other client languages might not be so lucky. --NoInkling (talk) 01:33, 6 February 2019 (UTC) Turns out it's already in the bug tracker, I've added a link. --NoInkling (talk) 01:52, 6 February 2019 (UTC)

JSON model much too verbose for languages
Why isn't it simply this?

There's no need to create separate subobjects for each value grouped each one with their language code, when these values can be directly used as values for the language key (for labels and descriptions) or as values in an ordered list (with implicit numeric keys in this list) which is the value assigned to the same parent language key.

Could a new API using a simplified JSON be used, to save lot of memory while loading objects from queries? Note that the strings loaded from wikibase are NOT atomized, many keys are repeated, like "en", "fr", "New York city", and "New York" in this example) and considered as separate values. And each returned subobject is a separate array even if they have the same content (same {"language":"code", "value":"xyz"} when they could be simply a singlre string "xyz"). In Scribunto, each new object is counted even if it is garbage collected.

This model does not really help much the reuse of Wikidata; the only case where one would want more properties to the values (other than the language) would be if the value was associated with some other qualifiers; but there's no qualifier for labels, aliases and descriptions.

And in fact, given that aliases are also in an ordered list, the labels and aliases could be in the same array, where the first position is the label (and the following positions are aliases):

If ever you want later to have some (language,value) pair to have other qualifiers (like best-ranking, or usage, or grammatical info like gender) you can still replace the isolated string by an object with a required "value" key for the string, and other optional keys for the qualifiers and their own value, but still no need to import the language in it:

But for now everything looks like if this JSON followed the model of RDF triples, without trying to compress the common references in triples, creating lot of unnecessary duplication of objects that should be the same. This has severe performance penalties. All these returned JSON data from queries should use a more compact format to reduce the memory footprint.

Note as well that there's still no easy way to query a specific property for an item (except labels, aliases and descriptions, which can be filtered for a specific language if needed when we don't care about resolving other fallback languages; or to get a list for the specified language(s) and their fallbacks), without loading all properties of the item. Here also this has a severe cost.

Note that labels/aliases and descriptions are just particular properties, exactly like the property "first name", except they don't need a property ID as theses properties have static names "label", "description".

This also means that the compression indicated above can be applied as well to values of a property and their attached qualifiers, i.e. their attached (prop-id, value) pairs.

Verdy p (talk) 15:51, 12 May 2020 (UTC)