Talk:Wikibase/Indexing/RDF Dump Format

Jump to navigation Jump to search

About this board

Jc3s5h (talkcontribs)

Disputed interpretation

In an edit at wikidata:Help:Dates Jarekt changed a statements to read "Wikibase software interprets years 1801-1900 with precision 7 as 19th century" and "Wikibase software interprets years 1001-2000 with precision 6 as second millenium". There is discussion on the associated talk page.

I believe Jarekt is referring to the interactive user interface, but I consider it wrong to refer to that interface as "Wikibase software". I believe the RDF API is just as much a part of Wikibase software as the interactive interface. The RDF data model documentation indicates RDF follows ISO 8601 and XSD 1.1 standard. Both of those indicate precisions by truncating the unneeded information. So, for example, for precision 100 years and a year of 1900, it would be truncated to 19 and understood to include any year from 1900 to and including 1999. Jc3s5h (talk) 22:30, 17 July 2018 (UTC)

Jarekt (talkcontribs)

RDF Dump Format deals with how data is stored not what it means. Current Wikidata standard of interpreting concept of 1st century as years 1-100, second century as years 101-200, etc. and 1st century BC as years 100 BC-1 BC, is perfectly consistent with international understanding of those terms for hundreds of years.See 1st century, etc. If you like to redefine those term than discussion on ''RDF Dump Format'' talk page is not the right place. Wikidata is not consistent with ISO 8601 and XSD 1.1 standards, which is unfortunate but as wikidata:Help:Dates mentioned Wikidata dates are "resembling ISO 8601" but do not follow it. Other difference is how we store BC dates and section d:Help:Dates#Years_BC explains conversions which are done to from the format used by Wikidata to RDF Dump format.

I do not understand your point about of Wikidata GUI not being part of "Wikibase software". I never tried this but I believe that if you create a different instance of wikibase using wikibase software than it comes with the GUI. So I am lost...

Jc3s5h (talkcontribs)

When I wrote "I consider it wrong to refer to that interface as "Wikibase software" I meant that the interactive user interface is not the only Wikibase-provide method to read and write data, so that writing as if the meanings implicit in that interface were followed by all intefaces is incorrect. I completely reject the notion that "RDF Dump Format deals with how data is stored not what it means." Explaining what the RDF Dump Format means is the purpose of this page.

Reply to "Disputed interpretation"
Jc3s5h (talkcontribs)

My first question is how to update the this document. A "sister" document, Wikibase/DataModel/JSON says that the document should not be edited in the normal way, but rather "NOTE: The canonical copy of this document can be found in the Wikibase source code and should be edited there. Changes can be requested by filing a ticket on Phabricator"

Does a similar process apply to this document, or is this document edited directly?

The change I think should be made is as follows, with bold showing material to add:

The full value includes the simple value above under wikibase:timeValue, precision and timezone as integers and calendar model as IRI. The timezone parameter has never been implemented and should be ignored; all times in the database are local times; that is, the timezone is not recorded as part of the time and must be deduced from other clues, such as the place an event occurred.

I consider this important because editors may be unwilling to contribute to Wikidata if it forces them to make false statements, such as the time zone being UT when it is really United States Eastern Daylight Time.

I will be making a parallel request for revision of Wikibase/DataModel/JSON.

[Text above added September 2016. Text below added 31 December 2016.]

At wikibase:Project chat#Adding a source @Pasleim: stated

If a date is given with day precision, one has to ignore all information which make more precise claims including the time zone. The time zone parameter is only needed for dates/times with at least hour precision. So Wikidata doesn't say anything if a date is in UTC or in local time. Basically a specific day is a time period of 50 hours, from 12:00a.m. in UTC+14:00 (Q7130) to 11:59p.m. in UTC−12:00:

While this may have been a defensible interpretation of earlier versions of this document, and maybe even a defensible interpretation of the JSON dump spec, this spec says

The simple value of the time value is either datetime value of type xsd:dateTime, if the value can be converted to Gregorian date in ISO format, or a string as represented in the database, if not. The xsd:dateTime dates follow XSD 1.1 standard...

Considerable effort has recently been expended to respect the XSD standard to always use the Gregorian calendar, by creating code to convert Julian dates to Gregorian dates. If this effort has been put into respect the Gregorian calendar aspect of the XSD spec, I infer the meaning of the "Z" an the end of the representations, which means Universal Time, would be equally respected. Jc3s5h (talk) 15:58, 31 December 2016 (UTC)

Lydia Pintscher (WMDE) (talkcontribs)

No this one is maintained here :)

Jc3s5h (talkcontribs)

In view if a Wikidata ambiguity about the date of Isaac Newton's death, I believe the section should also be revised to state the year is always deemed to begin on January 1, even though historically some countries have observed other dates to increment year numbers.

Reply to "Time revision"
Yurik (talkcontribs)

When storing sitelinks, shouldn't it be normalized URL ('_' instead of spaces)? Otherwise they differ from the canonical wiki representation. CC: @smalyshev (WMF):

Smalyshev (WMF) (talkcontribs)
Bobdc (talkcontribs)

A query for the schema:about value of <https://en.wikipedia.org/wiki/Duck> shows that it's wd:Q3736439, but the Sitelinks section of this page says that it's wd:Q3. Am I misunderstanding something or does this example need to be corrected?

Mbch331 (talkcontribs)

The example doesn't have correct values. It just has values to show how it's formatted. Examples usually have just random values as does this example.

Reply to "wrong example for Duck?"

Units representation and conversion

4
Smalyshev (WMF) (talkcontribs)

Planned units representation and conversion for full values:

  1. There will be a configuration on wiki (not specified here) describing conversion of units to standard units.
  2. Values in standard units will be expressed as "normalized" values (which later can be expanded to also represent other kinds of normalized values, e.g. for times, external IDs, etc.)
  3. Each of  psv/pqv/prv gets a duplicate called psn/pqn/prn linking to a normalized value node in parallel to a regular value. Normalized value can be the same value (if the value is already in standard units) or a different one. If the value has no units or the units can not be converted to a standard unit, no normalized value is produced and no psn/pqn/prn predicate is generated.
  4. Original value gets wikibase:quantityNormalized predicate linking it to the normalized value. This predicate is generated even if the value is already normalized - in this case, it links the value to itself. Normalized value also has wikibase:quantityNormalized linking to itself, since normalized value is its own normalization.

Unit representation & conversion for simple values is still TBD.

Smalyshev (WMF) (talkcontribs)

Example:

 wds:Q3-24bf3704-4c5d-083a-9b59-1881f82b6b37 a wikibase:Statement, wikibase:BestRank ;
       psv:P2 wdv:cb213eea7a0b90d1d7f65c6eabfab9da ;
       psn:P2 wdv:3efb2709bd74285cfc7e72b6a599125b ;
 wdv:cb213eea7a0b90d1d7f65c6eabfab9da a wikibase:QuantityValue ;
   wikibase:quantityAmount "+123"^^xsd:decimal ;
   wikibase:quantityUpperBound "+124"^^xsd:decimal ;
   wikibase:quantityLowerBound "+122"^^xsd:decimal ;
   wikibase:quantityNormalized wdv:3efb2709bd74285cfc7e72b6a599125b ;
   wikibase:quantityUnit <http://www.wikimedia.org/entity/Q828224> .
 wdv:3efb2709bd74285cfc7e72b6a599125b a wikibase:QuantityValue ;
   wikibase:quantityAmount "+123000"^^xsd:decimal ;
   wikibase:quantityUpperBound "+124000"^^xsd:decimal ;
   wikibase:quantityLowerBound "+122000"^^xsd:decimal ;
   wikibase:quantityNormalized wdv:3efb2709bd74285cfc7e72b6a599125b ;
   wikibase:quantityUnit <http://www.wikimedia.org/entity/Q11573> .
Pfps (talkcontribs)

It appears that this is already underway, at least for geospatial information. Is this the case?

Smalyshev (WMF) (talkcontribs)

Yes, we are starting to introduce unit conversions.

Reply to "Units representation and conversion"
Pfps (talkcontribs)

The RDF dumps have

Entity labels - the main name of the entity. Labels are defined as schema:name, rdfs:label and skos:prefLabel predicates with objects being language-tagged string literals.

Why say the same thing thrice, particularly as there are lots of labels for many items?

JanZerebecki (talkcontribs)

The idea was to be compatible with all 3 ontologies, so that things work if you support one of them.

Pfps (talkcontribs)

The problem is that this triple representation adds a *lot* of redundant information. Similarly for the multiple representation of values.

This would not be a problem if it was easy to ignore the "other" versions of the information one wants, but this dump has everything in one very large file. As the file is in Turtle format simple textual methods cannot be trusted to find and remove the pieces that are not neeeded.

Reply to "Triple representation of labels"
There are no older topics