Talk:Wikibase/DataModel

From MediaWiki.org
Jump to: navigation, search

Units[edit]

Units can use both different identifiers, different prefix systems and different composition techniques, and there can be uncertainty with parts of them. For example a "foot" can differ a lot, its not a US unit even if this is often assumed. Another example is a "mile" vs "kilometer" (km). Sometimes a value have a unique and simple representation in one unit system, but not a meaningful representation in another system. An example is "acre" which have no simple representation as "square kilometres" (km²). — Jeblad 07:05, 29 March 2012 (UTC)

And yet en:Template:convert somehow manages to work pretty well. What's your point? -- skierpage (talk) 10:52, 31 March 2012 (UTC)

Words[edit]

Words in one language might not have an equivalent with the same meaning in another language, or even a simple meaning at all. One example is the Norwegian støl, sel, seter and seterlag (the first three words are often used interchangeably but have different meaning; a place for milking used for transhumanance, a shieling used for transhumanance and a small farm used for transhumanance. The last one is a small hamlet of "seters".) [1], while another one is Northern Sami geassesadji (approx the same as seter or actually "summer place") and siida (kind of seterlag for reindeering with extended rights). The language links in Wiktionary or Wikipedia can hide such distictions. — Jeblad 07:45, 29 March 2012 (UTC)

facts aren't always per Wikipedia article[edit]

The Title is the title of the Wikipedia article about the entity. The Title and LanguageId together are unique, i.e. every article in a Wikipedia can only be referenced once by a Wikidata page.

That doesn't work. Take a car page like en:Nissan Altima. That has multiple infoboxes giving detailed facts about the car, but they conflict because they give information about different models of the Altima over time. A human uses context to figure out the more detailed entity for which the infobox provides facts. Usually this comes from a section heading, in this case "First generation (U13, 1993–1997)", "Second generation (L30, 1998–2001)", etc. So there are going to be several entities in Wikidata for the different Nissan Altima models, but each will reference this one Wikipedia article.

Now sometimes there's a separate page for each model covered on the page (e.g. en:Toyota Prius), so each entity can correspond 1-to-1 with a more detailed individual page; then you could imagine each section on the general page pulling in its facts. But not always. And I can imagine that one language will have a single page with multiple infoboxes, while another will have a separate page for each variant. -- skierpage (talk) 11:12, 31 March 2012 (UTC)

entities aren't unique per Wikipedia article neither[edit]

The current Wikidata/Glossary says that an "entity is is what a page in Wikidata is about. An entity is identified by its label and description (see Data model) and by the Wikipedia articles linked to the page." The last sentence does not imply a one-to-one relationship between entities and Wikipedia articles. I think this should be fixed in the model. An entity may have one single article in language A, multiple articles in language B, and be part of an article in language C. You can model this for instance with SKOS thesaurus relations (exactMatch, broaderMatch, narrowerMatch). -- JakobVoss (talk) 11:42, 31 March 2012 (UTC)

Hi, Jakob, Skierpage (long time not see to both of you!). A few comments here: so, in phase 2 there will be a way to create pages for items that do not have a Wikipedia article yet. So, the things that Skierpage mentioned will be possible by simply creating new items. Jakob, with regards to something else but 1:1 relationships between entries in Wikipedia and Wikidata and among Wikidata articles I am extremely reluctant. We will get some numbers to estimate the effect of sticking with the simple model. Regarding a more complex model, I would need to see an actual proposal of how the suggestion would exactly work. --Denny Vrandečić (WMDE) (talk) 14:06, 2 April 2012 (UTC)

Language codes[edit]

After reading the description of "LanguageCode", I think one thing is not very clear: will the LanguageCode accept any of the language codes defined at languages/Names.php? (e.g. "en-ca" and "en-gb") Helder 18:44, 3 April 2012 (UTC)

This is indeed not clear yet. It could be either 'only those codes that have a Wikipedia' or 'all the codes in the language names list' that you linked. If you know more about this, we would like to hear your thoughts.
Furthermore, do you know if the latter list is a proper superset of the former, or are there codes that have conflicting languages? --Denny Vrandečić (WMDE) (talk) 21:21, 3 April 2012 (UTC)
There are five fallbacks in the list, but at least one of them is the primary site. Ie, no (Norwegian macrolanguage) is a fallback for nb but the site is at http://no.wikipedia.org. Ideally no should be a supersite for nb (Norwegian Bokmål Wikipedia) and nn (Norwegian Nynorsk Wikipedia). Note also that some of the codes are somewhat inaccurate. For example the table says that se is Sámegiella (Sami macrolanguage) while it should be Davvisámegiella (Northern Sami language). For a better example of the Sami language names check the map in w:se:Sámegielat. The situation with no and se is somewhat similar, both should perhaps be a supersite of local variants. Unlike the Norwegian variants the Sami language variants isn't easily understood by speakers of the other variants, its more like the difference between Norwegian-German than Norwegian-Swedish.
Not sure if this really clarify anything. ;) — John Erling Blad (WMDE) (talk) 13:01, 9 April 2012 (UTC)

Metadata[edit]

Missing a metadata description of a set of proterties of items, like how a length of a river should be measured or how the number of peoples in a city should be counted. 93.220.104.127 10:18, 4 April 2012 (UTC)

Would be represented by a qualifier in a statement, e.g. "Population 86,000,000 (Method: Estimate)". Also, we think a qualifier footnote will probably be widely used. --Denny Vrandečić (WMDE) (talk) 22:01, 5 April 2012 (UTC)

positiveInteger[edit]

I hope I'm not painting the bikeshed when I say that

 positiveInteger: an integer number of arbitrarily large value greater than or equal to 0

should probably be changed to

 nonNegativeInteger: an integer number of arbitrarily large value greater than or equal to 0

That's what XML Schema does.Chrisahn (talk) 16:19, 7 April 2012 (UTC)

Changed. Thanks, well spotted. --Denny Vrandečić (WMDE) (talk) 16:34, 10 April 2012 (UTC)

Snak rank[edit]

Not sure if I have any really good example but an Item (article in Wikipedia) can describe something that has several interpretations, and each interpretation might have different main properties in each domain. This leads to conflicting main ranks of Snaks. One example is Kristiansand which is a city, municipality and county capitol of Vest Agder county. The main snak might change in each domain. That is; it is a city, it is a municipality, and it is a county capitol. — Jeblad 00:52, 11 April 2012 (UTC)

Rank for alternate values for a property makes sense. Its on the element, not the snak. I'm confused. — Jeblad 05:34, 11 April 2012 (UTC)
If an Item has very different meanings that are conflicting, then it should probably be split into two or more Items. The snak rank will only be a coarse selection of statements for the purpose of simple queries (which data to show in infoboxes), UI (which data to show by default when visiting a page), and export (which data to export in certain smaller export sets). --Markus Krötzsch (talk) 14:54, 16 April 2012 (UTC)
The best example I know about is the municipalities and cities in Norway, I believe it is the same in many other Wikipedias as well, and the duality between a «kommune» and «fylkeskommune». A kommune is the municipality, but some of them have additional responsibilities on behalf of the «fylke» or county. Its not easy to split the articles about them, but at the same time it makes it difficult to identify the correct snak to assign the rank.
Probably this is only very special corner cases anyhow. — Jeblad 07:08, 17 April 2012 (UTC)

Why to reinvent the wheel?[edit]

I think it would by a good thing to explain why to create a complete new data model an why not reusing something else, for example rdf.

This question has been discussed in detail on the mailing list, not just for RDF but also for various other data models and metadata standards, but I will give a short answer here for reference. The main problem is that the term "data model" is a bit overloaded. The Wikidata data model specifies application-level data structures that are important for Wikidata. In contrast, data models like RDF, XML, or JSON are data encoding formats that facilitate the exchange of data across application boundaries. The two types of data model are not mutually exclusive: Wikidata will export its data in RDF, OWL, JSON, and in whatever other format that is deemed useful. However, none of these formats can fully capture the intended meaning of Wikidata in its native semantics (which is not properly formalized yet). For example, RDF cannot state that a property has no value. The available data formats are continuously improved, and I am confident that some day there will be an easy and natural way to express all content in Wikidata in an existing format that is widely supported. The current design will allow us to support new standards as they become available. If we would commit to one encoding in, say, RDF (2004) now, then the meaning of Wikidata content would be determined (and limited) by this choice, and it would be difficult to improve the support in the future (e.g., when RDF named graphs become standard). Please refer to the wikidata-l mailing list archives for further aspects of this discussion (I will not answer here). --Markus Krötzsch (talk) 13:56, 16 April 2012 (UTC)

Times[edit]

The data model requires the value for seconds to be in the range 0 to 59.999.... but ISO 8601 allows the range 0 to 60.999... which accommodates leap seconds. This is a contradiction.

Also, the method to represent time zones should allow a value that in some way indicates the longitude of the occurrence of the event for events that occurred before time zones were established.

Finally ISO 8601 states

This International Standard does not assign any particular meaning or interpretation to any data element that uses representations in accordance with this International Standard. Such meaning will be determined by the context of the application.

But since wikidata will contain statements about specific events, there will be a particular meaning or interpretation in the source from which the value is obtained. For example, the time may be local apparent solar time, Coordinated Universal Time, or Terrestrial Time. Either a requirement should be established to convert the time into a specific time scale, or a field should be available to indicate which time scale is intended. Jc3s5h (talk) 13:32, 12 April 2012 (UTC)

The point about leap seconds is interesting. We need to support this, obviously.
Regarding time zones, I currently can only think about zones of the form "-7" or "+4.5". Since we need to support decimal values there anyway, it would be possible to support "geographic time" according to the level of the sun. The UI should obviously have good support for this. If we do not support this immediately, then it could still be introduced in the future even with the current data model.
The point about time scales is also important. Clearly, Wikidata must specify the exact meaning of all times that it stores. I am not an expert in this, but my current hope is that we can use one reasonably common reference system for time (e.g., UTC) and merely offer conversions from other relevant formats (in particular, the system could support sun-based geographic time (defining 12:00 p.m. to be the time when the sun reaches the zenith at a certain location, and extrapolating the time points between two such events; I am sure that there is a proper name for this and a Wikipedia article on how to calculate it ;-). To simplify the editing of such values, one could at least store the original input method -- it should then always be possible to recover the input data from the uniform representation. The question is which input methods should be supported. --Markus Krötzsch (talk) 15:08, 16 April 2012 (UTC)
UTC is not suitable as "one reasonably common reference system for time" because it was created in 1961 and cannot be meaningfully extrapolated before about 1955. UT1 can be used throughout the period when reasonably accurate clocks have been used. Local apparent solar time was used before reasonably accurate clocks became widespread (18th to 19th century). Astronomers use various time scales, and the ability to accurately convert among them degrades in the ancient past or distant future (for some purposes the error might be about 1 day for +- 10000 years).

ReferenceRecord[edit]

The ReferenceRecord will indicate the source, will it be also possible to briefly specify how the datum was generated in the first place (eg. "date: 50k year ago [Carbon-14 dating]" or "clade: Some biological group [mithochondrial DNA analysis]").--Zolo (talk) 15:58, 13 April 2012 (UTC)

This will be possible by giving auxiliary snaks, so it will be completely user-controlled which kind of additional information should be attached to a value. The reference record will have a largely fixed structure (and a special UI) in order to make it easy/convenient to edit and use references. --Markus Krötzsch (talk) 13:16, 16 April 2012 (UTC)

Level of measurement[edit]

From a statisticians point of view it would be beneficial to have something like a level of measurement assigned to the possible realization (

Deutsch:

Merkmalsausprägung; I guess it‘s property and its value in your terms). In its simplest presentation this would be nominal, ordinal and metric. An integer for example can represent all three levels. Say ‘2’ can be the second questionnaire, that was answered (when the succession is of no interest and ‘2’ is rather treated as a name than a rank); the rank of Berlin among european cities by inhabitants; or a price in USD. (These might not be the best example, but I hope they show my point). --Alexander Sommer (talk) 12:27, 14 April 2012 (UTC)

I don't envision Wikidata being used to store questionnaire answers, but rather to store the results of a study which employed questionnaires. I discuss what I perceive as missing in the "Confidence interval" section below. Jc3s5h (talk) 20:34, 14 April 2012 (UTC)
Sorry, my examples seem to be misleading. This is not at all about questionnaires. It was just the first example that came to my mind where a digit represents a name, not a number. If the Value of the Property population of the Item Berlin is 3499879, I think it would ease the reuse, if one knows that 3499879 is a metric datum (or a ratio datum, count datum, … with a more detailed classification). --Alexander Sommer (talk) 10:45, 15 April 2012 (UTC)
In general, the meaning of a value (even if it is an integer number) will always be informal, i.e., understood by the user (based on the property description) but not by the system (which will only consider it as a number in a plain mathematical sense). For example, logarithmic scales (e.g., decibel) have a different metric than linear ones, so there is not just "metric or not?", but also "which metric?". Wikidata cannot answer this, but it should be clear to the user. In general, I don't think that it is a good idea to capture nominal values by numbers; I would actually suggest to use a (non-translated) String in such a case, even if it has values like "1" or "3". I would even extend this to ordinals where one could also use letters A, B, C in most cases (how often have we seen standard deviations being computed for ordinal scales?); but this all seems to be something that has to be considered on the application level, and that cannot be build into Wikidata. --Markus Krötzsch (talk) 15:19, 16 April 2012 (UTC)
Sure, the meaning of a value is only understood by humans. Still, those levels make reuse easier. For example very good, good, bad, very bad is in alphabetic order bad, good, very bad, very good and it is just tedious to reorder them. Storing ordinal data as letters seems a good idea, but not enforceable when it comes to ranks (and more than 26 possible values). I already hinted that there is no consensus of what kind of levels of measurement exist. E. g. for metric data anything between ‘there is at least one metric that is applicable’ (metric in general) and ‘that kind of metric is applicable’ (ratio, logarithmic, …) is possible. I do not understand the data model well enough to judge whether or not it is possible to build such a feature, but once again, from a statisticians point of view it would be extremly useful. I am sorry not having any introductionary literature at hand (en:Level of measurement is not; EMILeA-stat as an online reference might give a first glimpse, but it is in German); maybe having a look at how different statistical software packages handle these levels, might clarify things a bit. --Alexander Sommer (talk) 16:40, 16 April 2012 (UTC)

As far as I know (my statistics is primarily biometry...) the correct term for what Alexander refers to is "measurement scale". Typically recognized scales are nomimal, ordinal, interval and ratio scales. Knowing the measurement scale of a value is essential when analyzing data. Compared to this, the difference between integer and real numeric is usually secondary (statistical measures like mean of integer value are real numeric...). I would value if wikidata supports knowledge about the measurement scale of values (as well as supporting a wide range of statistical measures...). G.Hagedorn (talk) 13:24, 15 September 2012 (UTC)

Missing values[edit]

Yet another nice feature (again from a statisticians point of view) would be to assign different realisations as missing. Say ‘not answered’ and ‘not applicable’. --Alexander Sommer (talk) 12:30, 14 April 2012 (UTC)

The data model just defines general-purpose data structures that can be used by editors to encode information. On this technical level, we only distinguish whether there is a value or not. However, it is possible to give additional information about this statement by means of auxiliary snaks. Thus it would be possible to encode that a value is missing, and to (optionally) specify the reason for this. It would not be useful to have a special construct on the level of the data model to express every possible reason for why a value is missing (in every conceivable application area), hence the design gives this power to the user. --Markus Krötzsch (talk) 13:23, 16 April 2012 (UTC)
Yes, comments (or “auxiliary snaks”) are useful. But giving the user the power to fill in any arbitrary text makes things hard to analyse. Imagine you want to make some statement like “the average unemployment rate is x, y countries do not publish their figures in genereal, z countries sample this figure only every two years and it is not this year”, it would just be nice to have a feature that allows for analyse of missing values. And an additional comment can still give the details. --Alexander Sommer (talk) 16:55, 16 April 2012 (UTC)
I think the data model should not prescribe the reasons why data are missing, but it could provide an attribute (like "reason") which points to a URI defining reasons. The difference between "not applicable" (data cannot possibly exist) and "unknown" (no source for the data could yet be found) is important for community data curation. It should be machine readable and analyzable. Other frequent reasons for missing data are that data are withheld for various legal or moral reasons (e.g. privacy protection of children). Please consider adding a "reason" attribute pointing to a instance resource (would that be an data item?) curated by the community . --G.Hagedorn (talk) 12:09, 15 August 2012 (UTC)

Confidence intervals for numbers[edit]

The data model presently supports a value (presumably the mean) and a variance. But if the mean and variance are obtained by sampling, there is uncertainty in the mean, which can be described by the confidence interval. But there is no support for confidence interval. Jc3s5h (talk) 20:49, 14 April 2012 (UTC)

Confidence intervals depend on assumptions about the distribution. The most common used confidence interval for the mean assumes the normal distribution. A confidence interval for the mean is then . The term is the quantile of the standard normal distribution and a fixed value (e.g. ). and are given according to your declaration. So what is really missing is , the sample size. And if we talking about values obtained from a sampling, this is in many ways a meanigful value. So I would prefer a support for the sample size instead of one for a confidence interval. --Alexander Sommer (talk) 11:18, 15 April 2012 (UTC)
I agree that a person able to appreciate the meaning of a confidence interval would find the sample size equally useful. However, a person transcribing data from a source might not have sufficient knowledge to make the calculation described by Mr. Sommer, so perhaps both confidence interval and sample size should be supported. Jc3s5h (talk) 12:42, 15 April 2012 (UTC)
Once we‘ve got mean, variance and sample size it is fairly simply to compute that confidence interval (and, of course, showing it at the „human interface“) without actually storing it. At that stage I would leave it up to the database experts whether it is more efficient to store that two values or compute them everytime they are needed. The problem with confidence intervals is, that they are not as unambiguous as it seems in the example above. Beside it is possible to calculate a confidence interval for any other estimator than the mean and one can make any other assumption about the underlying distribution, calculation can vary in the case of truncated or censored data etc. So I would not recommend just copying a confidence interval from a study, as the pure value is quite meaningless without information about the calculations made. A confidence interval is no more than a guess that the interval overlaps the true parameter with a certain probability. Studies should claim their sample size and, yes, some do not. But as we are going more and more into detail, I become unsure if Wikidata is aiming to store results from scientific studies at all. PS: Alex or Alexander will do.--Alexander Sommer (talk) 13:49, 15 April 2012 (UTC)
Seems to me that whats in the model is a simplified combined upper and lower bound? 92.229.55.223 15:16, 15 April 2012 (UTC)
The remarks made elsewhere also apply to this case: using auxiliary snaks, it will be possible to have arbitrary additional information for every property value. This could be used to express more elaborate forms of confidence and uncertainty. The question of course is to what extent such information should be an integral part of the data model. The data model should be as simple as possible and mainly cover aspects that the software of Wikidata should take into account (e.g., for query answering in later phases). For example, the data model needs some information about value precision in order to support automated unit conversions in a sane way. For query answering, in contrast, it is intended to use only one value (even if it is not fully certain). There are various ways to take uncertainty into account in query answering, but this tends to make the computation of query answers much more difficult, and it leads to fewer answers (since many things are not sure). Already when allowing statements like "the property has a value between 10 and 20", it is unclear how to, e.g., sort by this value (or is sorting impossible then?). Since a vast majority of values is uncertain to some degree, this would seriously restrict the capabilities of the system. If we would ignore such information and treat the interval like one value for the purpose of ordering, we are again back at the current data model. Anyway, the exact handling/modeling of imprecision is subject to further discussion, which should take place on the mailing list. --Markus Krötzsch (talk) 14:09, 16 April 2012 (UTC)

Expression of Core Wikidata Concepts in RDF and JSON-LD[edit]

The proposed Wikidata model includes several concepts that are closely related to RDF, and it is clear that an (eventual) goal is to express Wikidata Statements in a form which is at least compatible with RDF.

One requirement is that Statements contain claim and reference components. A claim can basically be represented as an RDF triple, while a reference is basically a provenance information about that triple. In RDF this can be represented either using triple reification, which is a deprecated mechanism for asserting information about triples. The current direction of the RDF Working Group is to use Graphs.

Named Graphs can be used to make statements about one or more triples, such as provenance information, which is can be used to describe a Wikidata reference. In TriG, this might look like the following:

 :Snak {
   <http://en.wikipedia.org/wiki/Berlin> a wd:Claim;
     :population "3499879"^^xsd:integer .
 }
 {
   :Snak a wd:Snak;
     :assertedBy <http://www.statistik-berlin-brandenburg.de/> .
 }

In this case, :Snak is a stand in for the URI of this particular Snak.

Yes, we hope that some such mechanism will become available in the future for RDF export. Note, however, that many RDF database management systems use named graphs like "databases", i.e., they optimize for cases where there are relatively few of them. Having as many graphs as triples would seriously hurt many of these systems, and may not really be in line with the intentions of RDF graphs (note that SPARQL relates named graphs to dataset selection, which seems to suggest another usage than the one proposed here). Moreover, some things that can be said in Wikidata can not be expressed in RDF but in OWL (especially this is the case for PropertyNoValueSnak). Hence, even a new RDF standard might not capture as much of Wikidata as the current OWL standard does already. Until the interplay of OWL and named graphs is specified in some standard, we will probably need to have multiple RDF and OWL export versions to cater for different users. Hopefully, there will be a standard that supports all of Wikidata in some future. --Markus Krötzsch (talk) 14:16, 16 April 2012 (UTC)
I understand this is how DBPedia tracks information in Virtuoso. And, this is certainly in the realm of issues being discussed for RDF 1.1. If the data model requires keeping provenance information at the Snak level, this either requires reification (not well supported), using other graph models (e.g., Property Graphs), or using the RDF named graph model and de-coupling from the dependencies on triple stores where this is a problem. Gkellogg (talk) 00:21, 17 April 2012 (UTC)

HTTP Range 14[edit]

As is noted in the model, it is important to distinguish between making statements about Items, not the pages themselves. This is known as the HTTP Range 14 problem [citation needed] (due to it's being #14 in the W3C Tag in 2002). Basically, when does a URI refer to a page, or to the thing that page describes. There are a couple of mechanisms used to deal with it:

DBPedia uses HTTP 303 (See Other) redirection, so that the Item would have a URI of something like <http://dbpdiea.org/resource/Berlin> which results in a HTTP 303 (See Other) referencing <http://dbpdiea.org/page/Berlin> where the information is actually represented. This mechansim has a lot of history, but is not universally loved, due to the large cost of constantly performing redirects. An alternative mechansim uses fragids to represent Items, for example <http://en.wikipedia.org/wiki/Berlin#item>. There is some recent discussion on re-opening the issue to provide some other means of specifying that an Item URI actually refers to the subject of the page, and not the page itself.

The model does say that there is intended to be a correspondence between Wikidata items and Wiki pages, but that they are not the same thing. This is certainly an opportunity to address this issue.

Answer: Since Wikidata will support many export formats, it will require content negotiation anyway. Other data endpoints for specific formats are likely, but they will not be the main IRI that is used in exports. --Markus Krötzsch (talk) 14:20, 16 April 2012 (UTC)

Datatypes[edit]

The typically mechansim for declaring the range of a given property uses rdfs:range to reference a particular XSD datatype (although it's not limited to XSD). However, in RDF, this does not limit actually values that can be used with that property, but rather allows you to infer that that value has that datatype. Of course, this could lead to a falsehood when performing inference operations. Another way of enforcing the use of properties is to use OWL property restrictions. For example:

 md:Claim a owl:Class;
   rdfs:subClassOf [
     a owl:Restriction;
     owl:allValuesFrom xsd:Integer;
     owl:onProperty :population
   ] .

However, given that the data model does not impose such a had relationship between Property and Datatype, this restriction is probably overkill.

Answer: it is not intended to support ranges in this sense. Wikidata will attempt to ensure in software that all values of a property correspond to its declared (Wikidata) datatype. However, it will not be possible to ensure that every piece of data is immediately updated whenever a property declaration changes, so it would be a source of inconsistency to claim a range. Moreover, this is about the only thing that a range statement would allow to be inferred; it cannot help to turn a value of the wrong type into a correct one. Furthermore, most Wikidata types are complex values that do not correspond to one RDF literal or resource, but are exported as a substructure. Hence, a more complex description would be required to axiomatize the expected structure of a property value in RDF/OWL. Again, there does not seem to be a practical reason to do this. --Markus Krötzsch (talk) 14:26, 16 April 2012 (UTC)

PropertyNoValueSnak[edit]

RDF does not have a general mechanism to state that some value does not exist, but the rdf:nil convention for empty lists could be re-purposed for this case. For example:

 :S { :AngelaMerkel :child () . }

Could be used to represent such information.

Answer: This would not be semantically correct. Having an empty list as a value is different from having no value. In particular, the statement you give there would entail ":AngelaMerkel :child []" which is the natural RDF encoding for "Merkel has some child" (i.e., the opposite of what you try to express). Since this follows even under plain RDF entailment, the suggested encoding would break RDF quite thoroughly. Fortunately, OWL provides vocabulary to express negation, so expressing this is not a problem. In general, our exports will use all available language features to encode as much information as possible, and use "proprietary" encodings for the rest, so that applications that are aware of the meaning can exploit the information, while applications that are not aware of the meaning would at least not get any wrong information. --Markus Krötzsch (talk) 14:33, 16 April 2012 (UTC)
Note that JSON-LD does allow you to express the information that a property has no value, the information is only lost in the RDF transformation. {"child": []} makes such an assertion. Gkellogg (talk) 00:21, 17 April 2012 (UTC)

PropertySomeValueSnak[edit]

This could possibly be supported either using simple BNodes, or by using more complex OWL restrictions to describe some constraints on the unknown:

 :S { :AmbroseBierce :dateOfDeath [] . }
Answer: Yes, both would work. There are some technical issues (RDF tools do not understand OWL restrictions, many OWL tools do not allow bnodes for representing data literals), but at least the processing of such information is well supported in principle. --Markus Krötzsch (talk) 14:35, 16 April 2012 (UTC)

InstanceOfSnak[edit]

rdf:type

SubclassOfSnak[edit]

rdfs:subClassOf

JSON-LD Representation[edit]

While TriG is quite useful for modeling information, as a representation language intended to be directly processed, it is not as usful. JSON-LD provides a full-fidelity representation of Named Graphs and other RDF concepts in JSON, which makes it convienent both for storing and working with the data. For example MongoDB uses JSON as a basic document representation (called BSON) that is compatible with storing JSON-LD documents directly. The Snak example used above could be represented as follows in JSON-LD:

 {
   "@id": "http://Snak",
   "@type": "wd:Snak",
   "assertedBy": "http://www.statistik-berlin-brandenburg.de/",
   "@graph": {
     "@id": "http://en.wikipedia.org/wiki/Berlin",
     "@type": "wd:Claim",
     "population": 3499879
   }
 }

Relative date[edit]

In describing historic events it is often known that something happened before or after some other event, often with some degree of certainty and resolution, but without knowing the absolute date. One example is Høre stave church where it is dated by a runic inscription as

Þá, um þat sumar [létu] þeir brœðr Erlingr ok Auðun hôggva till kirkju þessar, er Erlingr ja[rl fe]ll í Niðarósi

which translates:

The brothers Erling and Audun had the timber for this church felled, the summer that Erling Jarl fell in Nidaros

This refers to the Battle of Kalvskinnet in 1179. In this case the referred event is known, but in many cases it might not be known. Perhaps this can be solved if some other event could be the reference instead of UTC? Or would this be to difficult. — John Erling Blad (WMDE) (talk) 15:42, 15 April 2012 (UTC)

Strictly speaking it is impossible to state a date or time in UTC before 1961, because that is when it was created. I realize this is somewhat off your point of relative dating, and to the precision of a season, it would be easy to extrapolate UTC back to 1179. Jc3s5h (talk) 11:47, 16 April 2012 (UTC)
Yes, we will properly document the intended meaning whenever we allow such "proleptic" uses of time reference systems (same for Gregorian calendar). Fortunately, there are standard ways of doing this. In some cases, the system will also refuse to handle a certain level of detail, e.g., there will most likely not be months and days for 10000000BCE. --Markus Krötzsch (talk) 14:43, 16 April 2012 (UTC)
Wikidata will not support relative dates of this form but it is possible to specify a temporal distance using a number with a temporal unit (time spans are important for many applications, e.g., to specify how long a planet takes to orbit the sun). So if both events ("church building" and "death of Nidaros") are described in Wikidata, then one could (if one really wanted), record the time between them. --Markus Krötzsch (talk) 14:43, 16 April 2012 (UTC)

Relative position[edit]

Same problem as for date, a position could be relative. — John Erling Blad (WMDE) (talk) 16:37, 15 April 2012 (UTC)

It is not intended to support this with a special type, but again there will usually be ways to encode such information with the means provided. --Markus Krötzsch (talk) 14:44, 16 April 2012 (UTC)

Altitude[edit]

Altitude can both be referred to geographic position (ie height in meters or foot) and to air pressure (ie bar, mmHg or foot). Likewise with depth but with water pressure. And with relative altitude. — John Erling Blad (WMDE) (talk) 16:41, 15 April 2012 (UTC)

Wikidata will specify in detail which geographic positioning system is used. This will clarify the notion of altitude as well. For other uses, numeric properties can be used to record values of arbitrary meaning. --Markus Krötzsch (talk) 14:45, 16 April 2012 (UTC)
("Elevation" may be the correct term for altitude, see en:Altitude; "Geodetic System" (like WGS84) for Geographic positioning system, see en:Geodetic system G.Hagedorn (talk) 13:24, 31 May 2012 (UTC))

Speedvectors[edit]

In addition to fixed times and positions there is the whole spatiotemporal representation problem. I'm not sure if this is really necessary in Wikipedia articles. — John Erling Blad (WMDE) (talk) 16:44, 15 April 2012 (UTC)

Even this can already be captured by the existing data model, since a vector can usually be represented by providing a list of numerical coordinates. Hence, vectors can be stored. However, we will not be able to implement complex forms of spatiotemporal reasoning in Wikidata, so it is not necessary that the system knows that this is a speed vector (rather than "a list of three numbers" or the like). But storing such data will be possible. --Markus Krötzsch (talk) 14:48, 16 April 2012 (UTC)

Item versus Topic[edit]

Reading Wikidata/Notes/Inclusion syntax I again experienced a confusion on what the item is: The topic, the snak, property value? I believe Item is overly generic. As a result there seems to be a tendency to use it as "data item" rather than on its own. This then produces "data item" and "item data" (with "item data" seemingly being an abbreviation for "data item data" ;-) Wikidata/Notes/Inclusion_syntax).

While technically irrelevant, I believe it would help adoption to use a more memnonic term. "Topic" seems to be an excellent replacement for "data item" – with the added benefit of tying into topic maps. --G.Hagedorn (talk) 08:05, 31 May 2012 (UTC)

German[edit]

Please can somebody translate the data model to German? Thanks, --Markuss (talk) 06:31, 25 July 2012 (UTC)

It'd be great if someone would do that. I unfortunately can't do it in the foreseeable future. --Lydia Pintscher (WMDE) (talk) 13:12, 25 July 2012 (UTC)
May be you can help by giving the structure and starting the German page... --91.8.139.103 09:25, 26 July 2012 (UTC)

Datatype number, "variance"[edit]

The model at present says: "The variance specifies how far the true value of the represented quantity could possibly deviate from the number in positive or negative direction. This allows to capture expressions such as 12300 +/- 50". The attribute name "variance" seems to be a misnomer, compare en:Variance and en:Accuracy. "Margin_of_error" may be a more appropriate attribute name. --G.Hagedorn (talk) 11:52, 15 August 2012 (UTC)

It seems this "variance" number will be indeed upper and lower values, according to this page.--Djiboun (talk) 22:13, 2 January 2015 (UTC)

Language neutral and language dependent text types[edit]

1. The examples that come to my mind when thinking of language neutral strings (post code of a UK city, scientific organism names) all break down as soon as spoken text is considered. The pronunciation will be language (and culture) dependent.

2. The need for http://wikidata.org/vocabulary/datatype_multitext as a grouping of translation within possibly repeated values is understood. However, what is a convincing use case to introduce http://wikidata.org/vocabulary/datatype_monotext ? It seems that this might be redundant.

3. ISO language codes provide a value for text that is language neutral: zxx = "no linguistic content". It may therefore be possible to abandon the differentiation between http://www.w3.org/2001/XMLSchema#string and the language-specific strings. Given that the language neutral strings are relatively rare cases, the wikidata type model might be simplified to a single string type: http://wikidata.org/vocabulary/datatype_multitext.

(4. Aside: If keeping the current three different types, I believe that

are confusingly abbreviated, in the abbreviated form no indication of language is contained, only of cardinality (occurs one or multiple times). Proposal:

) --G.Hagedorn (talk) 12:21, 15 August 2012 (UTC)

Examples from PropertyValueSnak[edit]

Wikibase/DataModel#PropertyValueSnak:

   Berlin (subject) has a population (property) of 3499879 (value).
   Georgia (subject) has the capital (property) Tbilisi (value).
   Ghandi (subject) was born on (property) 2 October 1869 (value).

BTW, in this example, what classes will have the values? 3499879 is obviously DataValue, Tbilisi is obviously an Item, and 2 October 1869 is - ? It can be a datetime datatype (which is not listed to be supported, will it be string or sth like unixtime?), but in WPedias there are actually articles on 2 October and on year 1869. aww, I read there is Time DataValue, but this is maybe necessary however to connect it to corresponding entities automatically. Ignatus (talk) 20:01, 5 December 2012 (UTC)

Which phase for InstanceOfSnak and SubclassOfSnak?[edit]

Are InstanceOfSnak and SubclassOfSnak going to be fully available already in phase II or is the support scheduled for phase III? I guess these concepts are fundamentally required for lists but I also see they could be very helpful to the community when defining the mandatory and optional properties for phase II for different classes of items. In addition an early availability would enable the community to define the class hierarchies and start to classify the subjects. I expect both tasks to bare a lot of obstacles and will thus be time consuming. --Spischot (talk) 11:22, 8 December 2012 (UTC)

The implementation has been put on hold. --Spischot (talk) 11:46, 5 January 2013 (UTC)

Qualifiers vs. Auxiliary snaks[edit]

In some parts of the docs for Wikibase we talk about qualifiers and in other parts we talk about auxiliary snaks that qualify the statement. The word qualifier actually doesn't exist in the data model at all, yet the concept is used in the API and UI. It would be nice to have a clarification and more concise wording. 93.220.105.254 10:39, 22 March 2013 (UTC)

It's the same. Qualifiers is easier to understand than Auxilary Snak in some way, but they are really just synonyms of each other. --denny (talk) 11:26, 22 March 2013 (UTC)

Broken links?[edit]

Hi was doing a bit of linksurfing an clicked on [2] found at [3], which goes to a 404. Should that be the case? Actually all those types of URI lead to 404's.

Into dates and times section, //www.wikidata.org/wiki/Vocabulary/datatype_time is broken. --Alex brollo (talk) 07:51, 26 January 2014 (UTC)

These IRIs are not meant to be usable links. They are just identifiers. I made them non-clickable for now. --Thiemo Mättig (WMDE) 15:27, 25 May 2014 (UTC)

Date/time Year Part Contradicts XSD[edit]

@Markus Krötzsch: The date & time representation is said to follow ISO 8601. I don't know ISO 8601, but given the intention to integrate wikidata to the semantic web (by producing RDF exports), I think it's better to conform to RDF date & time representation, which means XSD. The latest spec XSD 1.1 part 2 says this about the year part:

  [56]   yearFrag ::= '-'? (([1-9] digit digit digit+)) | ('0' digit digit digit))

I.e. it allows any number of digits, leading zeros only up to 4 total digits, leading -, no leading +. The wikidata normalization to "11 digits" and "always a sign" contradicts the XSD spec.

It's unclear whether the "time" part of the structure can allow a timezone other than "Z". It should not, since that may contradict the "timezone" part, and it's better to state this explicitly.

  • In XSD, the allowed timezone offsets are from -840 to +840 minutes (-14:00 to +14:00). Please state this interval for the "timezone" part.

Cheers! --Vladimir Alexiev (talk) 13:22, 31 December 2014 (UTC)

Vladimir Alexiev, the current direction of development, which I am skeptical of, is being discussed at phabricator:T88437 and would require the date to be written in the calendar specified by the URI that is also part of the TimeValue. This would of course be incompatible with XSD and ISO 8601, so everything would have to be revised to avoid mentioning ISO 8601. Jc3s5h (talk) 20:36, 21 February 2015 (UTC)

Geographic locations[edit]

This document and the JSON document indicate that latitude and longitude may be specified with 9 digits to the right of the decimal point. But for that degree of precision to be meaningful, it is essential to specify metadata for the coordinates, such as NAD 83(2011). Is there any provision for specifying the metadata for the coordinates, or alternatively, is there an exact coordinate system which is required, considered the default, or preferred? Jc3s5h (talk) 00:36, 26 March 2015‎ (UTC)

Lack of geographic shape data type causing trouble[edit]

See d:Wikidata:Project chat#Coordinates precision. Since the geographic shape data type seems to never have been finished, there is very widespread use of datatype Geographic location, value GlobeCoordinateValue. If the size of the geographic feature is much smaller than the uncertainty in finding the location, one might say "who cares?" An example of this would be a monument that is 1 meter square which is located by scaling from a United States Geological Survey map.

But what if the geographic feature is much larger than the uncertainty in locating it? A case in point is this Phabricator thread in which User:Multichill advocates entering the precision as the size of the object. That's one plausible way to go, but it isn't the least bit obvious to a consumer of data from the database that is the meaning of the precision.

In older versions of the data model (https://www.mediawiki.org/w/index.php?title=Wikibase/DataModel&oldid=1425786#Geographic_locations this one for example) there was a value for the size of the thing who's location was being stated. But that is gone from the present version of the data model.

So for the time being, we should decide what a geolocation means what a geographic position is when applied to something of non-negligible size. Some possibilities:

  1. The point is believed to fall within the boundaries of the thing, but in reality might be outside the boundaries of the thing due to measurement uncertainty.
  2. The point is as close to the center of the thing as the measurement procedures allow.
  3. The precision value is the greatest of 1/2 the object's north-south extent, 1/2 the object's east-west extent, or the uncertainty due to the measurement method used.

A later consideration is whether the dim value should be added back so uncertainty in measurement can be handled separately from the size of the thing. Jc3s5h (talk) 21:52, 8 July 2015 (UTC)

Use WKT for GeoShapeValue[edit]

This has gone on too long, I propose Wikibase (and Wikidata) use WKT for GeoShapeValue. It is defined in ISO 13249-3 in 1999, but I believe it's used internally in the industry standard Shapefile format (the de facto GIS exchange standard) used since the 1990s. It is a basic data format; if something like GML is desired later, we can always change the datatype and convert the data (no worse for the projects than having nothing for years.)

The lack of this data type is causing serious harm to the Wikidata, Wikipedia, and other WMF projects. We can't start doing anything until this is done. There are many things that cannot be discussed or described without reference to their geography, and hence they cannot be discussed or described. Because of this. Many, many, important things.

What needs to be done to formally propose this? esbranson (talk) 21:10, 22 November 2015 (UTC)

And if there isn't an affinity to the data type name, we can use WKTGeoShapeValue. esbranson (talk) 21:15, 22 November 2015 (UTC)
I've opened task T119346 on phabricator. esbranson (talk) 22:20, 22 November 2015 (UTC)

Snak definition[edit]

I'm unsure whether I understand how the article defines a snak.

“For lack of a better name, any such basic assertion that one can make in Wikidata is called a Snak (...)” (Wikibase/DataModel#Snak)

In particular I am uncomfortable with what the “such” is supposed to refer to here. Is the point that a snak can be either a property-value pair or a property statement declaring no value ?

Regards,

Tinm (talk) 17:48, 24 August 2015 (UTC)

Could one start a translation of this page?[edit]

As the title of the section say.
Ogoorcs (talk) 16:02, 4 March 2017 (UTC)