For Sense gloss, I'd rather have it as schema:description than rdfs:label. I don't think gloss makes sense as a label - it's a descriptive text, not name.
Extension talk:WikibaseLexeme/RDF mapping
I totally see your point, Stas. But one big advantage of having an rdfs:label is that the rdfs:label is used by many tools for default rendering. Thus having an rdfs:label is better than not having it.
This leads to the question whether it is better to have something else as the rdfs:label. The one other candidate I can think of is the associated Lexeme's label. But this means that all Senses of a Lexeme have the same label, which also might be confusing.
What is your counterproposal: 1) have no rdfs:label, 2) have the gloss be both the rdfs:label and the schema:description, or 3) use the Lexeme's lemma as the rdfs:label on the Sense? Or 4) something else entirely?
I agree that the mapping is not perfect, but it seems pragmatically OK. No?
I agree with Denny. For me, we add rdfs:label alongside skos:definition only for compatibility with rendering tools. But imho the Wikidata Query Service should only index skos:definition in order to keep a clean schema, for saving space, and in order to avoid affecting the results of the existing queries that are using rdfs:label to look for items.
I think I'm ok with having rdfs:label if it helps real-life use cases. I probably won't keep it in WDQS but if it's useful for other tools then fine with me.
I am quite curious, why is it rdfs:label rather than lemon:SenseDefinition?
Hey Noé. we are using
skos:definition for sense glosses and additionally
rdfs:label for compatibility with external tools. In query.wikidata.org only
skos:definition is loaded. I don't see the definition of lemon:SenseDefinition in the Ontolex draft we are following and in the Lemon model it seems to be a class and not a property.
Hi Tpt. You're right,
lemon:SenseDefinition is a class, the property is
lemon:definition and it's interesting, it is not part of Ontolex model (but Lexinfo 2.0 still use it as it is not up-to-date). So,
skos:definition make sense.
rdfs:label seems too vague for me, but I got the point, compatibility is importante.
Ok, I thought the Lexical data will provide glosses and no definitions, but the property used for now is definition. How do you map both glosses and definitions? It is not very clear to me.
Lexical data provides indeed only glosses as part of the native data model. We mapped the relation between a sense and its gloss(es) to
skos:definition. It looks fine to me because
skos:definition is, I think, much broader than the linguistics "definition" (c.f. skos spec). This usage has been suggested by the sentence
A definition can be added to a lexical concept as a gloss by using the from the Ontolex draft.
I agree skos:definition may include gloses and definitions, as defined by Ontolex, but I don't think it's a good option to mix both under one property.
It may happen an incompatibility with other lexical ontologies that used the same property for definitions. And if at some point Wikidata community decides to include gloses and definition, they may have to change the properties used for the former one, isn't it?
Finally, for people with no knowledge of lexicography or ontology, it may make unclear what a glose is in Wikidata.
Well, I have no better solution to offer right now, but it is puzzling.
Number of statements in data nodes
FYI, in T195387 I added support for writing the number of statements to the page_props table, which means it will probably also end up in the query service. Since the granularity of page data is the page, not the entity, I decided to count all the statements of the page there, including the statements on forms and senses. This might be a bit awkward on the query service, especially if we merge data nodes and entities as we currently do for items. Given a lexeme like
wd:L64723 wdt:P2 wd:Q3; ontolex:lexicalForm wd:L64723-F1; ontolex:sense wd:L64723-FS1. wd:L64723-F1 wdt:P2 wd:Q4. wd:L64723-S1 wdt:P2 wd:Q5.
you would get
wd:L64723 wikibase:statements 3.
instead of, as might be expected,
wd:L64723 wikibase:statements 1. wd:L64723-F1 wikibase:statements 1. wd:L64723-S1 wikibase:statements 1.
Do you think that’s acceptable?
Given that we don't have other pagedata on forms, I think it is. Also, these markers are most useable as kind of gauge of item quality and such. Since the real "item" here is the Lexeme, and Form & Sense are just sub-structures that exist only in context of Lexeme and only have IDs for technical reasons, I think it is ok.
There is the use case where you may want to see Forms/Senses without statements, but it's easy to check for with SPARQL.
only lexemes have data nodes; now documented on proposal itself
The proposal currently doesn’t mention data nodes (e. g.
wdata:Q2). I don’t think there’s too much to discuss (we shouldn’t need any special triples beyond the standard
a schema:Dataset, and
schema:about wd:Q2 that we already have for items), but I do have one question: should we have data nodes for sub-entities (forms and senses) as well, or only for lexemes?
Yes, I omitted them because imho they should be the same as items/properties and not depend on the entity type. The wdata:... nodes are about the "storage unit" (a.k.a. wikipages) that contains the data and could have a revision timestamp... So, I believe it make sense to have only them for lexemes and not forms and senses that belongs to the same "storage unit" as their lexeme. It will also stay consistent with the current behaviour of outputting there some page properties. If you are ok with it I could add a section about it in the proposal.
I agree that it makes more sense to only have data nodes for the lexemes, but I think it would be nice to mention this somewhere in the proposal.
I agree that Forms and Senses don't need to have separate data nodes, but I think they should be connected to the Lexeme's data node via schema:about.
Yes, it's indeed useful. But it adds a lot of triples for what I believe is a use case that could be easily covered by a property path (
schema:about/(ontolex:lexicalForm|ontolex:sense)?). @Smalyshev (WMF) what do you think about it?
wdata: node is compressed into
wd:, so I don't think having extra
schema:about has too much sense for querying - you could easily get from Form/Sense to Lexeme node, which will have the
schema:about. Long story short, I think the scheme is good as it is now.
WDQS data differences
I’m curious which differences Stas will want to introduce for WDQS? :)
I would prefer if we could keep the
a wikibase:Lexeme (
:Sense) triples (though we could perhaps drop the redundant
ontolex: versions) – the fact that we drop
a wikibase:Item is occasionally inconvenient for queries, since it leaves you with no obvious generic way to select “any item”.
And similar to how we drop
skos:prefLabel for items (redundant with
rdfs:label), perhaps we’ll drop one part of the
rdfs:label pairs? Though I’m not so sure about that – all of them seem valuable in a way,
rdfs:label as the generic predicate and lemma/presentation/definition as the one specific to a certain entity type. (For example, when writing a query that’s specifically about senses and I want to show the gloss to the user, I think using
skos:definition would make the query more readable.)
If we have
ontolex:LexicalEntry anyway, why also keep
wikibase:Lexeme? It's just doubles the data and I don't see much advantage. In the dump - sure, but I don't see how it would help querying.
I agree. We should not have
wikibase:Sense in the query service if we add them alongside of the
ontolex: classes in the RDF export.
I agree to keep classes for lexemes, forms and sense. I would rather drop the Wikibase flavor instead of the ontolex: version in order to have a content as compatible as possible with ontolex in the query service.
I agree with you on readability. I would drop rdfs:label instead of wikibase:lemma, ontolex:representation and skos:definition in order to have more readable queries. It would also avoid cases like lemmas unexpectedly appearing in existing SPARQL queries that are using rdfs:label.
I recommend to keep rdfs:label for Lexemes, Forms, and Senses, for compatibility with generic tools like Protege.
@Duesentriebcould you explain a bit more what is the relation to Protege in this conext? I.e. how it is used there, etc.? I am not too familiar with these tools.
- Not sure I see the point for having wikibase:lemma, since rdfs:label would do exactly the same thing.
- I would still make schema:inLanguage with language code, even if that is derived data and not primary. It may simplify querying a lot. And we already have config for this anyway. Of course if the language has no ISO code that one would be empty.
- dct:language has a range of dcterms:LinguisticSystem. We don't have this class on our language items, so it might be wrong.
- From the two above, I think we need to use schema:inLanguage and wikibase:language, unless we find a good language predicate with unresricted range.
- Maybe we change wikibase:grammaticalFeature to wikibase:partOfSpeech? To keep it similar to lexinfo.
- Not super-happy about having both ontolex:representation and rdfs:label but I see how it could be useful
- wikibase:grammaticalFeature sounds fine
- skos:definition seems to be closer to description than to label... But depends on usage.
Thank you very much for your feedbacks. Some comments:
wikibase:lemmahas the advantage of being specific to lemmas and so allows to do queries like "get all lemmas with the label
"foo"@enwithout having to do a filter on entity types. I would keep
wikibase:lemmain the Query Service and filter out
- Big +1 to it. I'm adding it to the document as derived data.
- I don't think it's a big problem. The triple
dct:language rdfs:range dcterms:LinguisticSystemis meaning that the RDFS entailment on our data is deriving that all items used as a language for Lexemes are also
dcterms:LinguisticSystem. It does not look wrong to me. We already have such behaviours with, e.g., the use of
cc:licensethat have for
cc:Work. If we want to be safe and avoid to use this term we should probably also avoid to reuse
ontolex:terms that come with a quite expressive OWL ontology: http://www.w3.org/ns/lemon/ontolex
- See 3.
- If we use
wikibase:partOfSpeechwe move a bit out of the wording used by the abstract data model and the JSON specification. And it seems to me that "grammatical feature" is a bit broader than "part of speech" But I am not very familiar with computational linguistic so I may be wrong.
- I believe we should just drop
rdfs:labelfrom the SPARQL endpoint.
- It is the property that is suggested by the ontolex: specification to encode glosses. It is presented in the SKOS spec primer as
skos:definitionsupplies a complete explanation of the intended meaning of a concept
+1 on 3. and 4. as described by Tpt.
If range is not an issue for
dct:language then maybe we should use it. Having extra prefix is not a huge deal - as soon as we add one, we can add more. I don't think it is an issue then.
Just a note, all the concerns I had were already expressed here by Smalyshev - I also prefer just using rdfs:label but from the discussion there seem to be other people who don't like it; not sure we can fully resolve it, but count me as another vote to at least keep rdfs:label as noted for the Lexemes and Forms (you could drop the Senses case though as it's not really a label). On the language labeling, Dublin Core has a very common problem of inconsistent usage but I suppose if Wikidata is using it consistently it's not a problem here. Count me as another vote for a wikibase:language predicate though if you're thinking about it. Otherwise, congratulations on this proposal, it seems pretty simple and well thought-out!
More general questions
- Would Lexemes be in their own dump or included in the main Wikidata dump?
- When it is planned to enable RDF and dumps (rough estimate is OK)?It will need some work on WDQS side (esp. due to Lexeme/Form thing - right now Updater wouldn't understand that part properly) so I'd like to plan for it.
- Any estimates on how lexical data growth would be looking like? That depends on if we do any mass imports of course. This is important because some performance tweaks for new data will require full DB reload, but if there's not a lot of data we can wait with it for a while.
1. Unless we do something I think they're in the main Wikidata dumps for now. Given that the dumps are a problematic area we should probably think about separate dumps for Lexemes. @Leszek Manicki (WMDE) might be able to say more about which data will be added to which dump at the beginning. (We turned off a lot of exports or stripped them down to the bare minimum for now.)
2. RDF is blocked on finishing this mapping. Once we have agreed on the mapping we can start implementing. I can hold it off on whatever other work needs to be done first in the Query Service. Just let me know. Then we can start creating blocker tickets etc.
3. We'll be asking people to take it slow - especially at the beginning. I hope editors will follow that wish. Beyond that I have a hard time making predictions.
We don't need to block anything on WDQS I think - until we enable Lexeme namespace on WDQS, Updater will ignore it and if it's separate dump then it will live separately and we can enable it whenever we're ready. I just wanted to know when to plan the work on it.