Links mentioned too early
- Indeed, the word "connect" was used instead of "link". Good catch, changed that. --Denny Vrandečić (WMDE) (talk) 11:17, 23 June 2012 (UTC)
On the special values none and unknown
- COMMENT 1: "This should be not mixed up with the notion that it is unknown whether an item has a value for a specific property, e.g. if a person had children." is confusing – the situation is essentially the same as with the year, i.e. in theory it should be possible to add either "none" or a specific value to the property. Not being able to do, just like in the case of the year, can only be a consequence of limited knowledge. The real distinction seems to be between "unknown" as a consequence of not having invested sufficiently into solving the question, and "unknown" after all realistically feasible effort has been invested. Proposal: Unknown means that the property has either a value or may be none, but that an exhaustive search of relevant literature or data makes it likely that the answer to the question is not known to science or mankind. It should not be confused with "unknown to the present individual editor" G.Hagedorn (talk) 11:42, 20 June 2012 (UTC)
- Remember that Wikidata is a secondary data-base. A value of unknown simply means "the reference says that the value for the given property is unknown" not "the value for the given property is unknown". It is not about the editor searching for the truth, but reporting what the source at hand says. If Britannica says Shakespeare's date of birth is unknown, then we can report this, even though a new book may include the date. --Denny Vrandečić (WMDE) (talk) 11:28, 23 June 2012 (UTC)
- COMMENT 2: DELTA (not applicable, variable, and unknown) and SDD (http://www.tdwg.org/standards/116/) had developed related vocabularies, SDD named it "status code" to distinguish it from normal values. Some of the codes in SDD may be too specific or may be too much internal management, the SDD list is:
- "To be checked" Explicit indicator to revisit a property later. This may occur when data are missing (known to exist, but not at hand for entering) or together with data (check entered data against additional information source): To be coded, Probably exists
- "Not to be coded" a decision was made not enter data with the goal to set priorities and conserver effort-resources: Not to be coded, Probably exists
- "Not applicable" Data are assumed to be impossible to exist: Cannot be coded, Cannot exist
- "Data unavailable" Data could not be obtained despite that an effort was made: Cannot be coded, May exist
- "Not interpretable" Data are known to exist, but are purposely not coded because not even an interpretation with probability modifiers was deemed possible: Cannot be coded, Exists
- "Data withheld" Data are present, but are not disclosed (e. g., because private or confidential).
- Of these, I consider "Data withheld" to be essential for the purposes of Wikidata (example is: children of a prominent person are known to exist, but privacy laws to not allow to provide an image, birthdates, etc. for them).
- Personally, I would furthermore consider "Data unavailable" as a useful status indicator for cases where data may exists, but a reasonable effort in obtaining them failed.
- G.Hagedorn (talk) 11:42, 20 June 2012 (UTC)
- As far as I understand it, "Not applicable" is none and Data unavailable is unknown. "Not interpretable" should always be possible to formulate using qualifiers. "To be checked", "Not to be coded", and "Data withheld" should be something that is handled by community processes. Thank you to the point to DELTA and SDD, indeed interesting! --Denny Vrandečić (WMDE) (talk) 11:28, 23 June 2012 (UTC)
While confirming the importance of a qualifiers (= modfiers in SDD) I would prefer "annotation" over "footnote". Footnote is a strongly non-semantic formatting notion, and while when presenting some Wikidata in tabular format, footnotes may be the most appropriate means of display, this certainly is not the case in many other cases, like natural language text or many infoboxes, where at least short statement annotations like "as of 2004" may be better served by adding in parenthesis after the value. G.Hagedorn (talk) 11:47, 20 June 2012 (UTC)
- Changed, good point. I hope that "as of 2004" would not be an "annotation" but rather a qualifier "as of" with the value "2004". --Denny Vrandečić (WMDE) (talk) 11:14, 23 June 2012 (UTC)
- Proposal: Split the reference into the main reference object, which ideally should have a URI identifier (doi, isbn, url, etc.) and a free-form text reference detail. The latter may be used to refer to individual chapters, pages, page ranges, figures, tables, appendices, track numbers, seconds since start of video, or other fragment identifiers within unpaginated or unnumbered items. G.Hagedorn (talk) 11:42, 20 June 2012 (UTC)
On moving the discussion here
Note: the above as moved here does not make much sense any more, as it was referring with respect to the context it was posted in. To get the context, see http://meta.wikimedia.org/w/index.php?oldid=3845480 G.Hagedorn (talk) 11:03, 22 June 2012 (UTC)
- Yeah, but I want the primer to be short, that is why I asked Lydia to move the comments here. Discussions should be on the discussion page. --Denny Vrandečić (WMDE) (talk) 11:14, 23 June 2012 (UTC)
Doug Lenat's comments
Overall, this is very clear and well organized. But let me just sketch (1) the aspect that I'm worried about, and (2) an idea for how to fix that:
(1) To a large degree this perpetuates the simple <property item value> triples that the mainstream has (d)evolved into, plus one preconceived meta-level qualifier "escape valve" (from my POV, reference is just another qualifier). As you know, we started Cyc with a frame-and-slot representation much like this, and had to keep expanding that representation language, CycL, kicking and screaming at every step, in order to represent the everyday things that are said in, e.g., a newspaper. By now CycL has turned into a higher order logical language, but well short of that are issues about representing inherently ternary relations (such as "between") and higher-arity ones; negation (the various pragmatically important species of such); disjunction; statements about statements (maybe you are already allowing a statement to be an item -- the current data model document doesn't make that obvious); and so on.
(2) One idea for having our cake and eating it too would be to define the data model to be a nested set of layers, with one layer being more or less like the current data model. There would be additional expressiveness enabled and represented formally at each higher layer, with relatively well-understood/explicit semantics and algorithms for lifting upward and projecting downward. For example, some of the meta-level properties at one layer might be projected down as qualifiers at the next inner layer. The motivation for using the more elaborate upper layers (which would have increasing representation machinery) is increased expressivity, increased ability to have mechanical inference derive useful conclusions automatically (e.g., reasoning about information which at the lower layer would be relegated to qualifiers). The motivation for using the less elaborate layers is increased efficiency in knowledge acquisition and inference (it's easier for more people to learn that language and use it properly, and it's much faster for a program to derive conclusions from assertions all of which can be expressed in that simpler language.)
This idea is pragmatically useful because most of the world's data really is simple enough to fit into your current data model, most of the rest could fit into an OWL-level model, most of the rest could fit into second order predicate calculus, most of the rest... etc. And, analogously, most of the people in the world can learn to read/write the current statements, with increasingly smaller fractions able to (learn to) read/write the increasingly more expressive formal languages.
The impact of this might not be too large on the current design and near-term efforts, since the innermost layer needs to exist anyway, and will be the most useful one; but if we start out with a plan for where we want to end up, we might avoid cutting corners that later have to get uncut and redone or, even worse, as with the original HTML, are so successful and heavily used that making even the obvious improvements and extensions take forever.
- +1 for "References are just another type of qualifier". I think I will extend this comment later --✓ (talk) 19:59, 8 September 2012 (UTC)
Ambiguous fields and data records
Following the Berlin example, even assuming the same Berlin a data field such as the area or population of Berlin can be ambiguous. Berlin has an effective urban area, but Land Berlin, an artificially defined administrative area, has an area and population which may differ from those of the city itself. For Bremen, is it the city, the Bundesland or Bundesland Bremen minus Bremerhaven? Each one is a legitimate statistical area, depending on the spatial reference.
In Britain we have a constant clash of geographies: what is the area of "Derbyshire"? The name could refer to the historic county, or to the local government county of the same name or to the "ceremonial county", all of which have very different boundaries and any one of which might be a legitimate geographical reference. One might have a separate data record for each or one might accommodate them all in one with three sets of data fields. Derbyshire is but one random example: there are many more.
In a similar manner, for a town infobox, there might be a "county" field but it would depend on which type of county: Peterborough could be in Northamptonshire, City of Peterborough or Cambridgeshire, depending on the spatial reference.
For instance the population of Berlin changes and you will get a number from the statistical office every year. How can this be included to WikiData? Certainly one can add hundreds of
Statements Claims named „Population (1900)“, „Population (1901)“, …,„Population (2012)”. I doubt that this is „clever“. Is there something like a Datatype „Table“?
E.g.: Berlin has the
Statement Claim „Population“ with the Datatype „table“. The Value of Population holds two columns of different datatype, date and the integer (population at this date). Now a good question is how this could be referenced in a clever fashion. As I understand one can add more than one reference to one statement Claim. What happens if different rows of the table have to be referenced differently?
- This is what qualifiers are in the data model. It'd be the same property that is used (in the case you mention probably number). But each of them would have a qualifier saying for which year it is. --Lydia Pintscher (WMDE) (talk) 17:03, 12 March 2013 (UTC)
- It is three-dimensional at least: [data field] x [date] x [which meaning of „Berlin“]
- To take the Bremen example again, one can legitimately say Bremen and mean Land Bremen, or Land Bremen ohne Bremerhaven, or the actual urban area of the town of Bremen regardless of Bundesland borders; the latter is what most ordinary people, if not the political class, would mean by Bremen. We would need provision for all these in the table, and I see no difficulty with this beyond presentation. Visitor from Wikishire (talk) 08:46, 3 April 2013 (UTC)