Topic on Extension talk:WikibaseLexeme/Data Model

LA2 (talkcontribs)

In the current text, the "English noun bank" is used as an example. The text reads: "(e.g. "financial institution" and "edge of a body of water" for the English noun bank)". But if you look in en.wiktionary for "bank", the entry is structured as 4 different etymologies, many of which take both the form of a noun and a verb. The first etymology derives from Italian banca, meaning bench, and refers to a financial institution, where there is a noun (a bank) and a verb (to put money in the bank), the second etymology refers to physical geography such as a beach, where there is again a noun and a verb. But here, in the WikibaseLexeme data model, nothing is mentioned about etymologies, only the triplet language, lemma, and part of speech. Why? Is this something that was forgotten by mistake, or is it a deliberate design? In other languages than English, the same lemma and part of speech might have different inflections for different etymologies. In Swedish, the plural is banker (financial institutions) and bankar (beaches), respectively.

Jpgibert (talkcontribs)

Hi,

Furthermore, in french (but this fact exists in all languages) there are a lot of words for which the etymology is not sure. For example, the word Macabre has got 3 hypothesis about its etymology, the word Galimatias 2, etc.. It is important to take this in account in the model I think. And moreover, the hypothesis have not the same level of credibility. Thus, I think that it is interesting to provide a mechanism allowing to sort the hypothesis, IMHO.

Daniel Kinzler (WMDE) (talkcontribs)

Etymology is not mentioned in the model, because we expect it to be represented using Wikidata-Style "Statements". Statements give you exactly the power and flexibility you are asking for: you can have multiple competing statements with different sources, you can attach them to Lexemes or to individual Forms or Senses, you can mark them as preferred or deprecated, or qualify them using any property you like, use them to refer to Wikidata Items, etc.

Etymology of course is a complex topic, and I don't expect it to be covered exhaustively using Statements. The etymlogical information represented on Wikidata will be the machine readable minimum. For a thorough explanation, we'd still need text -- on Wiktionary, I expect.

As to the same lemma having different inflection based on etymology: if the inflection is different, it's not the same Lexeme in the sense of this model. In the proposed model, a Lexeme does not correspond directly to what is now on a Wiktionary page: A Wiktionary page would cover the lemma "bank" in all languages, all word classes, and all morphologies. In Wikidata, there will be one Lexeme for every morphology -- and thus, at least for each combination of language and word classes. But in some cases, there would even multiple Lexemes for the same language and word class, if they differ in morphology. In German for instance, there would be two distinct Lexemes modeling the nouns "die See" (the sea) and "der See" (the lake), because they differ in morphology, since they have different grammatical genders (to add to the confusion, "die Seen" is the plural form of both, "der See" and "die See"...).

Tropylium (talkcontribs)

This issue is basically an abridged form of the homonymy versus polysemy problem, for which there is no unambiguous solution always. Wiktionary draws one fairly hard line: different etymologies are taken as proof that e.g. bank is at least three homonymous words (the 'bench' ~ 'row' senses could probably be argued to be polysemic) instead of one or two. Other criteria could be used, such as difference in meaning + inflection. For Wikidata's uses, etymology is probably not the best choice, since Wikidata, IIUC, is not planning on formalizing etymology too much. (Speaking as an etymologist, this is a good idea. Etymologies are theories rather than facts, and any exhaustive formal model of them would have to operate on a probabilistic rather than binary logic.)

Note though that inflection alone does not work as a sufficient distinction between homonyms, given variation such as shit : past tense either shitted or shat. Moreover, note that this is not necessarily looser than the etymological condition either. By this criterion, e.g. grind : grinded 'to gather experience points in a video game' is a different word from grind : ground 'to make into powder', while by the etymological criterion it's a single word with variable inflection.

Psychoslave (talkcontribs)

I completely support your indication about the theoretical quality of etymology.

However I'm rather confident that this kind of that can be structured into a model which doesn't constraint to subsume "one hypothesis set to rule them all".

Looking at the Wikidata glossary, I think that the way claims and statement are defined, with ability to add qualifiers, ranks and sources form a good framework for presenting multiple theories. Maybe ranks labels wouldn't fit well, but with qualifier you should be able to say that a theory have active suporters or not, whether it was proven or invalidated by some practical means… Actually, you might think about making statements about a statement and so on, I don't know if it is currently possible within Wikidata, maybe someone like @Lydia Pintscher (WMDE): could confirm/infirm that.

Lydia Pintscher (WMDE) (talkcontribs)

Yeah there is a lot of what you can do with qualifiers. Have a look at the item for Barack Obama for example and how his country of citizenship is modeled. Or the country statements for Jerusalem. It is not possible to make statements on statements but there usually is a way to do it with qualifiers, ranks and references.

Psychoslave (talkcontribs)

Thank you @Lydia Pintscher (WMDE): for this examples.

More on a side note out of curiosity, but was it purposefully chosen NOT to allow statements on statements, or is it an idea which wasn't raised? In the former case, if you do had some discussion on the topic, I would be interested to read it. Also while I'm it: - is it technically possible to make a wikidata item about a wikidata statement? - is it allowed/forbidden/undiscussed/other within Wikidata?

As said, that is really curiosity, and I do agree that qualifiers already offer much flexibility.

Lydia Pintscher (WMDE) (talkcontribs)

It was a design choice at the very beginning of the project because it would introduce a huge amount of complexity in the UI/API/data model for what seems like very little gain. I don't think there was ever much discussion about it. Items about statements: In theory possible but... not sure again how useful it is. I guess we'll need to discuss concrete cases and go from there.

Psychoslave (talkcontribs)

I think I should make more contribution to Wikidata to have a feeling of it in order to give more concrete cases that might be relevant concrete cases, or realize I could not come with any relevant case.

Reply to "Etymologies"