Talk:Wikibase/DataModel/JSON

This page is helpful https://www.wikidata.org/wiki/Help:Wikidata_datamodel

not JSON
JSON doesn't include sequences of JSON objects so the dump is not really in JSON. This should be prominently mentioned.

Also to be mentioned - each entity is in the dump.


 * Did you look at the dump? It has the form [ {...}, {...}, ... ]. A sequence of (entity) objects (though the order is insignificant). -- Daniel Kinzler (WMDE) (talk) 12:34, 24 August 2016 (UTC)

Typo's?

 * In section #Time, it says: Universal time universal time. Looks like the link intended is w:Universal time.
 * It says: "In JSON dumps, each entity is encoded in as a single line". IMO the second "in" is to be removed (or is it my understanding of English?). - DePiep (talk) 17:08, 5 April 2018 (UTC)

Disputed interpretation
In an edit at wikidata:Help:Dates Jarekt changed a statements to read "Wikibase software interprets years 1801-1900 with precision 7 as 19th century" and "Wikibase software interprets years 1001-2000 with precision 6 as second millenium". There is discussion on the associated talk page.

I believe Jarekt is referring to the interactive user interface, but I consider it wrong to refer to that interface as "Wikibase software". I believe the JSON API is just as much a part of Wikibase software as the interactive interface. The JSON data model documentation does not use the term millennium at all, and only uses "century" in a section that speculates about future developments. The document states "That is, 1988-07-13T00:00:00 with precision 8 (decade) will be interpreted as 198?-??-??" and clearly this should be extended to interpreting precision 7, 100 years, as 18??-??-?? so interpreting 1900/precision 7 as being in the same range of uncertainty as 1801/precision 7 is incorrect, according to this document.

Empty labels/descriptions/aliases/sitelinks give empty array instead of object
On Wikidata, when an entity doesn't have a single label/description/alias/sitelink (probably also applies to claims as well, though I haven't verified), their corresponding property in the JSON will have an empty array as the value, rather than a plain object as usual. Is this intended? Here's an example: https://www.wikidata.org/wiki/Special:EntityData/Q61519072.json (if it's been changed, try finding an another item to look at on wikidata:Special:ItemsWithoutSitelinks). I imagine this may cause issues for clients in some cases. Luckily in JavaScript an array is just a specialized type of object, so JS clients may not even notice, but other client languages might not be so lucky. --NoInkling (talk) 01:33, 6 February 2019 (UTC) Turns out it's already in the bug tracker, I've added a link. --NoInkling (talk) 01:52, 6 February 2019 (UTC)

JSON model much too verbose for languages
Why isn't it simply this?

There's no need to create separate subobjects for each value grouped each one with their language code, when these values can be directly used as values for the language key (for labels and descriptions) or as values in an ordered list (with implicit numeric keys in this list) which is the value assigned to the same parent language key.

Could a new API using a simplified JSON be used, to save lot of memory while loading objects from queries? Note that the strings loaded from wikibase are NOT atomized, many keys are repeated, like "en", "fr", "New York city", and "New York" in this example) and considered as separate values. And each returned subobject is a separate array even if they have the same content (same {"language":"code", "value":"xyz"} when they could be simply a singlre string "xyz"). In Scribunto, each new object is counted even if it is garbage collected.

This model does not really help much the reuse of Wikidata; the only case where one would want more properties to the values (other than the language) would be if the value was associated with some other qualifiers; but there's no qualifier for labels, aliases and descriptions.

And in fact, given that aliases are also in an ordered list, the labels and aliases could be in the same array, where the first position is the label (and the following positions are aliases):

If ever you want later to have some (language,value) pair to have other qualifiers (like best-ranking, or usage, or grammatical info like gender) you can still replace the isolated string by an object with a required "value" key for the string, and other optional keys for the qualifiers and their own value, but still no need to import the language in it:

But for now everything looks like if this JSON followed the model of RDF triples, without trying to compress the common references in triples, creating lot of unnecessary duplication of objects that should be the same. This has severe performance penalties. All these returned JSON data from queries should use a more compact format to reduce the memory footprint.

Note as well that there's still no easy way to query a specific property for an item (except labels, aliases and descriptions, which can be filtered for a specific language if needed when we don't care about resolving other fallback languages; or to get a list for the specified language(s) and their fallbacks), without loading all properties of the item. Here also this has a severe cost.

Note that labels/aliases and descriptions are just particular properties, exactly like the property "first name", except they don't need a property ID as theses properties have static names "label", "description".

This also means that the compression indicated above can be applied as well to values of a property and their attached qualifiers, i.e. their attached (prop-id, value) pairs.

Verdy p (talk) 15:51, 12 May 2020 (UTC)
 * As well the representation of claims is too much verbose with lot of redundancy: there are many requirements such as enforcing the presence of a required "type":"*" pair which can only be valid if its value matches an expected type (like "statement"). Such enforcement may just be checked by the client interface, then dropped from the returned data. As well statements listed for a given property always reference the property itself (whose Id should match). The "numeric-id" is not needed at all and can be dropped, it should even be deprecated completely (only the full id like "Q12" or "P50" as a string is accurate, not 12). The "hash":"*" pair is also not needed (they are equivalent to the unique object reference in Lua: once the object is built from this unique hash and has a reference, we don't need it in the client). All this consumes lot of data in memory (notably when we use getEntity which loads everything).
 * getEntity is very costly, there's no way to filter automatically the set of properties we really need (the only filter we have is with the getBestStatements which is not enough).
 * We should be able to load an object partially, with ONLY the list of property ids (not necessarily sorted) that have matching statements (possibly filtered by minimum rank so we can discard immediately statements with insufficient ranks; as well as properties whose values are "special/unknown") so that we can check very fast if an object has a defined property with at least one defined value, and then just query those values (and their details like their type and rank) Another filter would say if we want the qualifiers or not (and just the list of qualifier types so we can manually check if some qualifiers are interesting to get, and then query the qualifiers selectively for a specific statement identified by the item, property type, and property number of that type if the property has multiple values).
 * With all these filters, Wikibase could be tuned to implement sublayers of caches (for now it caches only full items, up to 15, or full statements, up to 50; it could cache qualifiers separately, and the cache for items does not need to contain the statements and qualifiers, and the cache for statements need not contain the qualifiers directly: they can just remain weak references which can be autocleaned: it will be to the client application to maintain its own cache for the items, statements, qualifiers that it is really interested in and the default cache sizes containing ech less data per element could be increased with more elements than just 15 full items or 50 full statements). As well the cache for labels should be separated: the items in cache would jsut have a weak reference for labels/descriptions/aliases, with one entry per language of interest (this means that the language must be one of the available filters, the item keeping only a list of language codes, and the query could as well implement BCP47 for loading only languages needed for fallback resolution) Verdy p (talk) 17:09, 19 May 2020 (UTC)