Extension talk:WikibaseLexeme/Data Model

Jump to navigation Jump to search

About this board

Pronunciation representation and audios

BoldLuis (talkcontribs)

Newbie question : where can I see them for a word (in a language)? (for the moment, the simpliest question).

Yug (talkcontribs)

Hello BoldLuis. For audios of words, best place to starts could be Wikimedia Lingualibre.org's language categories, such as Commons:Category:Lingua_Libre_pronunciation-fra. Then find the (largest) speakers of your target language, and dive in. Lili currently has 400,000+ audios, mostly in 10~20 Western and Indic languages. We do our best to diversify and find new languages. You may also come and ask on Lingualibre, were some users have Sparql experiences and can help you further.

BoldLuis (talkcontribs)

Thank you a lot.

Yug (talkcontribs)

Also, fyi, Lingualibre's Recording Studio also has an lowly known feature to video record sign languages via the computer's camera. We recorded few FSL (French Sign Languages) signs already. Strongly recommend a clean background when filming. If interested, please ping us on our main forum so we add the Spanish Sign language as well which will allow people to video record them.

BoldLuis (talkcontribs)
Reply to "Pronunciation representation and audios"
Ajoposor (talkcontribs)

Is it possible to include fields at the Lemma or Senses level that may later support language learning?

I'm a daily user of wiktionary mainly because I've been learning languages already for many years. I would say that the use of wiktionary for my mother language could be a tenth or a hundredth compared to its use for second language learning.

In learning a language it could be helpful to be able query the wiktionary for sense by its frequency. There are frequency list already, but I've not seen yet one that does so by sense.

In addition to examples by sense, there could be a section, also by sense, supported by the learners community to include . For instance, mnemonic sentences and or images that may help while learning a word. This mnemonics could be rated by users, so that the most voted are displayed. Here there may be an option to set the mother language, since a mnemonic could be very good in your mother language is French but have not sense for someone that speaks Japanese.

With this data, some sister projects may be developed as learning tools, like learning cards, improved from what is already available outside wiki, using not only repetition learning, but good mnemonics.

Denny (talkcontribs)

I like the ideas a lot. Regarding the sense frequency, we would need some source to get this data from - is there something like this?

Tropylium (talkcontribs)

I'm not sure I follow entirel, but this is not data that is possible to record "about a word" in isolation. Frequencies always apply to a specific corpus, e.g. prose, newspapers, technical writing, online forums…, not to a language as a whole. It therefore seems inappropriate for Wikidata. I agree though that the Wiktionary coverage could be improved a lot, but that's something to take up at the individual Wiktionaries you're interested in.

Ajoposor (talkcontribs)

Hi, yes, a frequency is specific to a corpus, but it doesn't mean we must discard its use. We may need to agree on a Corpus, for instance, the whole wikipedia could be used as a proxy. There are some dictionaries that already display a frequency. So frequency counting could be done with algorithms so that frequencies may be adjusted over time.

But going beyond the usual frequency, I would suggest having a methodology so that frequencies be assigned to each meaning (there are many words with multiple meanings, some of them rarely used, an that poses a problem to language learners). In order to do so, there could be an algorithm that takes sample texts containing a word, this sample would be left to users to be analyzed and assigned a meaning numeral to each appearance. In this way, there could be a calculation of frequencies by meaning. It could be made through a gamification of this task.

I tried to find out who are the users of dictionaries but couldn't find an answer. We may be so used to dictionaries that consider them as a given. But it is important to know who is using the dictionary and what are their needs.

I'm interested in addressing the needs of language learners. Learning a language is a task that takes A LOT of time. So it is a perfect target for improvement and optimization.

BoldLuis (talkcontribs)

I agree to use this frequency data.

Reply to "Language learning features"

Senses connected to Wikidata Items

Sander Stolk (talkcontribs)

Dear all, I've noticed the editorial note on this page that there's still a need to address how Senses can be related to Wikidata Items without implying synonymy between senses that are related to the same Item. (See quote below.)

  • Editorial Note: We should find a good place to address a common source of misunderstandings: Senses can be connected to Wikidata Items via an appropriate Statement they evoke or denote (compare lemon:denotes and lemon:evokes). However, such a connection should not be interpreted as the lexeme actually representing the concept defined by the item (compare lemon:LexicalSense and lemon:LexicalConcept). In particular, if two lexemes have senses that refer to the same concept in this way, this does not imply that the two lexemes are synonyms. Example: The lexemes for the English adjectives "hot" and "cold" could both have a sense that refers to Q11466 (temperature), even though they are antonyms.

This issue has been recognised and addressed in the lemon-tree vocabulary. There, the property http://w3id.org/lemon-tree#isSenseInConcept fits the purpose and has been used for topical thesauri. Perhaps it is worth considering using this approach here, too.

Reply to "Senses connected to Wikidata Items"
Deryck Chan (talkcontribs)

I see "features" as separate from "statements" in the proposed data model. Will they be modelled as property-value pairs or a different data structure?

Over at d:Wikidata:Property proposal/Lexemes, a number of properties are being proposed, like "person", "gender", "number", which will fit into the "features" component of the Lexeme data model.

Lea Lacroix (WMDE) (talkcontribs)

Hello Deryck, I hope I understand your question right.

The so-called features are for example: the lemma, the language of the lemma, the language of the lexeme, the lexical category. In the forms, the representation and its language. These pieces of information are not represented by triples, but it's going to be a simple field (a bit like the label and description in items). Some of these fields will have autocompletion from Wikidata items.

If you want to look at what it will look like, you can try the demo system (information is not necessarily correctly modeled there, it's mostly a sandbox try the interface)

Let us know if you have further questions :)

Deryck Chan (talkcontribs)

Yes that makes sense. We aren't separating grammatical features by category (or properties).

Deryck Chan (talkcontribs)

Will the lemma (the Lexeme itself) have a "grammatical features" field? I only see that forms have " grammatical features" but it seems that the Lexeme doesn't. For example, how do we represent the fact that "chien" (fr) is masculine regardless of form?

Lea Lacroix (WMDE) (talkcontribs)

No, the grammatical features are only included in the Forms. If you want to indicate something about the lexeme, you can decide to have a dedicated property and add it in a statement.

Reply to "Features"
Denny (talkcontribs)
Reply to "Form types"

An alternative model proposal: logomer

Psychoslave (talkcontribs)

The more I'm reading and thinking about it, the more I'm inclined to consider that the model is trying to give a too rigid framework.

What we are interested to document in Wiktionaries, is chunks of discourses, and what is claimed about that chunks in such and such theories.

A lexeme is an abstract structure which is already far too committed into a closed theory of language, that is it doesn't provide space for presenting language analyzes which doesn't fit a lexemic structuring.

The mission of Wiktionaries is documenting all languages. Side note: this doesn't state written language, spoken language, or in fact even human languages, so depending on local consensus, you might see bee language documented.

What is aimed here, as far as I understand, is to propose a structured database model to support this aim.

So, the model must allow to document lexemes, sure. But that could be done as a lexemic relationship. For example cat and cats, Baum and Bäumen are two couples in lexemic relationships, that could be recorded as 4 distinct entities.

To really support goal of Wiktionary, the model must also allow to document w:lexical item, morphs, w:morphemes, etymoms and whatever discourse chunk a contributor might want to document and relate to other discourse chunks. A lexeme class can't do that, or you must come with such a distant definition of lexeme that it won't match any of the already too many existing one among linguistic literature.

I'm not aware of any consensual term for the "discourse chunk" in the sense I'm suggesting here (token doesn't fit either). So, in the rest of this message I'll use logomer (see wikt:en:logo- and wikt:en:-mer).

A discourse is any sign flow[note 1].

A glyph is any non-segmentable[note 2] sign that can be stored/recorded.

A logomer is a data structure which pertains to parts of a sequence of glyphes representing a discourse.

A logomer must have one or more representation.

A representation must have one or more form.

A single form must be elected as label.

A representation should indicate which representational systems it pertains to.[note 3]

A logomer must be related to one or more meaning.[note 4]

A logomer form must be extractable from a glyph sequence that represents a discourse.[note 5]

The extraction process of a logomer form must keep every unfiltered glyph.

The extraction process must not add any glyph.[note 6].

The extraction process must not alter any glyph.

A logomer form must include one or more glyph sequences (thereafter named "segment").

A segment must provide a glyph string.

A form including more than one segment must provide an ordinal for each segment.

A segment ordinal must indicate the relative position of a segment with respect to other segments of the form, relatively to the begin of discourses where it appears.

A segment might be void.

A void segment might serve as boundary marker, indicating possible positions for other segments which are not part of the current logomer.

All logomer forms of a single representation must be congruent under permutation.[note 7]

An indistinguishable logomer form might appear in multiple discourses.[note 8]

Distinct occurences of the same logomer forms with distinct meanings must induce distinct logomers.

Distinct meanings attributed to the same discourse parts should appears in a single logomer.

A logomer form might be taken as a discourse of its own.

  1. More criteria regarding meaning is purposefully set aside
  2. That is, in regard of the sign system used. For example a code point of a character encoding system could be segmented in several bits, but a bit is not a sign of the encoding system itself, even if a discourse using this system can make references to such a sign.
  3. For example, through statements. Accuracy of this information might be left to community. It could be things as vague as "casual oral retranscription" and "direct matching of written document", or more precise like "phonemic system of the International Phonetic Alphabet" and "official orthography in the Dutch spelling reform of 1996"
  4. Or definition, or whatever indication of its sense
  5. Discourses that can't be represented as a glyph sequence are not considered
  6. So boundaries markers as hyphen in morphs, like logo-, aren't part of a logomere
  7. That is, all forms have the exact same set of segments, only ordinal of this segments can change.
  8. But happaxes are logomer forms too, though
Psychoslave (talkcontribs)

Actually, it's not yet a fixed model, clearly. In fact I already slimmed it deeply while creating the following graphical representation:

Visualization of an alternative to the Lexem data model for a Wikibase support of Wiktionary

However it might be too slim. Maybe keeping at least one mandatory field related to meaning (but valuable with a null value) would be better, whether on the logomer, or on the logomer form.

This way it's possible to indicate a difference between "wikt:fr:grand homme" and "homme grand", the former being (in French variant I'm aware of) always used to indicate a famous person, while the later indicate a person is tall.

But I'll already wait for feedback, especially from Noé, Benoît Prieur, Delarouvraie, Lyokoï, Jberkel, psychoslave, Lydia Pintscher, Thiemo Mättig, Daniel Kinzler, Epantaleo, Ariel1024, Otourly, VIGNERON, Shavtay, TaronjaSatsuma, Rodelar, Marcmiquel, Xenophôn, Jitrixis, Xabier Cañas, Nattes à chat, LaMèreVeille, GastelEtzwane, Rich Farmbrough, Ernest-Mtl, tpt, M0tty, Nemo_bis, Pamputt, Thibaut120094, JackPotte, Trizek, Sebleouf, Kimdime, S The Singer, Amqui, LA2, Satdeep Gill, Micru, Vive la Rosière, Malaysiaboy and Stalinjeet

LA2 (talkcontribs)

When you float away in your abstractions, you attract dreamers who like such abstractions, but repulse people who are able to sit down and do real and concrete work. Wikipedia is a success not because it is a perfect and abstract ideal of a theoretical model of knowledge, but because it is a simple tool for processing ASCII text.

Lyokoï (talkcontribs)

Sorry, but I don't understand what you want to do...

Amqui (talkcontribs)

I am pretty confused about the intent here as well...

Tofeiku (talkcontribs)

I was surprised that I'm listed here but I'm pretty confused as well with this.

Rich Farmbrough (talkcontribs)

It is certainly true that we want to document things of unknown or even no meaning "AOI", "Nautron respoc lorni virch" or the archetypal meaningless phrases of philosophers such as "hig mig cig". Even then there is context - there is always context.

Rich Farmbrough 11:45, 31 August 2017 (UTC).

Psychoslave (talkcontribs)

Ok, it seems I need to explain more what I aim to provide here.

Shortly, a data structure which target carrying less abstract data but allowing relationships useful for wiktionaries.

So taking let's take the English adjective "hard" as a first example, so one might compare with current model examples.

Exemple 1: hard

In this model the string (glyph sequence) "hard" might be recorded as following:

Logomer: L64723


  • Label: hard (Form 1)
    • (that is, the linearization of the segments, which here is a single element)
  • used in discourse expressed in: English (Q1860)
  • lexical category: adjective (Q34698)
  • Derived from: heard (L112) Oldenglish adjective
  • other statments might also add registers, glosses, definitions, synonyms, antonyms, translations, related concepts and so on

Form 1

segments: hard

    • segments in detail: (0, hard)


  • used within representation system of: written utterences (Q98123723736661235)
  • prounonced as: hɑːd (L387483) (the logorem itself can indicate corresponding representation systems)
    • Qualifiers:
      • Region: Scotland (Q22)
    • References: ...
  • prounonced as: hɑɹd
      • Region: Scotland (Q22)
    • References: ...
  • prounonced as hard.ogg
    • Qualifiers:
      • Region: United States of America (Q30)

(Rhymes should be infered from associated phonetic logomers, which is even more important with cases with regional differences)

Form 2

There is no other indisputable form for hard in this model. But one might suggest that hard- in hardcore is just an other form of the present logomer. As said, that's disputable, but for the sake of the example, here is how this second affixal form would be represented with this model (so possibly in a distinct logomer):

segments: "hard", "-"

    • segments in detail: (0, "hard"), (1, AGLUNATIVE__MARK)
      • 'The AGGLUNATIVE_MARK might be a special value, or a string containing a single Soft hyphen for example.


Exemple 2: je me mis la tête à l’envers

Now, here is a second example which doesn't come from those provided for the Lexeme model, but that might enlight what I had in mind while trying to outline a design for logomer.

So, in French, "je me mis la tête à l’envers" is an inflectionned form of the phrase "fr:se mettre la tête à l’envers". In the model of logomers, each inflection have a single separated instance. That is "je me mis la tête à l’envers", "tu te mis la tête à l’envers", and "se mettre la tête à l’envers" are three diffrent logomers. Possibly they could group common statements in an other entity, but that's an other topic.

Forms in logomers are here only to carry permutations and related statements such as grammatical acceptability in a given frame.

For example "je me mis la tête complètement à l’envers", "je me mis gravement la tête à l’envers" and "à l’envers, je me mis la tête" are all less commonly heard but grammatically acceptable to my French native mind, and clearly are using instances of "je me mis la tête à l’envers".

Thus "je me mis gravement la tête à l’envers" might be described as the following form

  • segments: "je me mis", " ", "la tête", " ", "à l’envers"
    • segments in detail: (0, "je me mis"), (1, SPECIFIER__MARK), (2, "la tête") (3, SPECIFIER__MARK) (4, "à l’envers")
      • 'The SPECIFIER_MARK might be a concept entity such as "adjective", linearized as as simple space or "[…]" for display purposes.

And "à l’envers, je me mis la tête" might be described as the following form

  • segments: "à l’envers", " ", "je me mis", " ", "la tête",
    • segments in detail: (0, SPECIFIER__MARK), (1, "à l’envers") , (2, "je me mis"), (3, SPECIFIER__MARK), (4, "la tête")

Note that something like "me je tête l’envers mis la à", which certainly wouldn't be recognized as grammatical for a French speaker, doesn't fit any permutation of the segments proposed here, but nothing in the model prevent to document it in an other logomer.

I hope it helps @LA2:, @Lyokoï:, @Amqui: and @Malaysiaboy: to grab my approach.

Lyokoï (talkcontribs)

Sorry for the french :

Attends, c'est la version wikidata de wiktionary que t'essaie de faire là, non ? Écoute, je n'ai jamais pris le temps d'y faire quoi que ce soit. Je n'y ai pas envie d'y mettre du temps et de toute façon, je pense que ce n'est pas la bonne solution. Merci de me laisser à côté de ça. Je m'y impliquerai quand j'y verrais un intérêt pour le Wiktionnaire.

Lyokoï (talkcontribs)

(Je rajoute qu'en plus c'est en anglais, et que j'y comprend qu'à moitié...)

Denny (talkcontribs)

@Psychoslave, thanks for the effort in trying to create a better model. I want to point out that the current proposal for Wikidata's Lexicographic model is not just thought up by the Wikidata team, but an adaptation of lexicographic data models that have been developed over the last century starting with TC 37, later under ISO as the Lexical Markup Framework, and then captured in RDF under the Lemon model. Wikidata is very much in that tradition, which means it is distilling literally the knowledge of hundreds of linguists over a century of work.

Just to raise three points with your model:

1) whereas you claim that it also extends to Bee language, I am wondering whether this is actually a requirement. Wikidata's first (although not only) priority is to support its sister Wikimedia projects. Is there any Wiktionary that actually captures bee language? If we move to far away from our requirements we might create a solution that is more complex than necessary.

2) whereas you claim that the Bee language is a requirement, your model later is restricted to languages represented with glyphs. This seems contradictory to me? Did I miss something?

3) in your example for hard, you state that meanings and antonyms could be expressed via statements on the level of the Logomer. But antonyms are not pertaining to a specific Logomer, if I understand Logomers and antonyms correctly, but usually to a specific sense of the Logomer, i.e. to a specific definition. But I don't seem to be able to express the antonym relation on the definition. Maybe I am just missing something.

Again, thank you for your proposal. It is unfortunate that it comes so late - the discussions about the data model were held years ago, fully in the open, and with wide invitations. It is not easy to fully appreciate such a contribution just a few months before the planned roll out of the project.

Psychoslave (talkcontribs)

0) I read the whole talk page of Wikidata4Wiktionary, so I was aware of the important analyze work you have done and used. I didn't yet read all the documentation about Lemon though. Anyway, my concern is not about the Lemon model, or the current proposed Lexeme model as a useful tool in many context, but in the very precise context of Wikidata4Wiktionary. If tradition seems a good fit for grounding our goals, great, let's leverage our work with it. Otherwise, let's set them aside, rather than sink under the weight of its unsuitable hypothesis.

1) If that's the case, I'm not aware of it. The bee language was of course an extreme example. I'm all for a simpler model. One which remove as much as possible from any linguistic theory while letting the ability to express them through its internal mechanisms. My current proposal seems still far too complicated and confusing for other contributors, so to my mind, it is not good enough either. Sticking to our requirements is great, but what are our requirements. I didn't saw the document exposing clearly this requirements, and how they were produced, so if such a document does exist, please let me know. To my mind, the requirement 0 is a class designed to store strings, going from d to wikt:bayerischer gebirgsschweisshund, but also including affixes such as wikt:-in-, morphs, and any sequence of characters one might encounter in the world. I tried to go further with the "ordered segments" of utterance, but that's maybe already a too complex model for our goals. Then the requirement 1, is to be able to document those strings, so those who encounter them can type them in a Wiktionary and discover where it is suspected to come from, whether it might mean something, its contrary or nothing at all depending on context. Yes, even strings with no actual (publicly known) meaning is worth documenting so people who encounter them can grab the knowledge of this substantiated absence of sense. And finaly, requirement 2 is to be able to glue all this pieces through relationships, preferably in a way that allow as much automated inferences. That's the basic requirements I would expect from a Wikidata4Wiktionary.

2) I think more probable that I didn't explicated my idea clearly enough, rather than you missed something I said distinctly. So my idea is that the data model about an utterance performance, but a recordable representation of such a performance. The representation only refer to the performance. Maybe a theater analogy would be more significant here: there is a written script before the show performance and you might have a recorded video, but the performance itself is gone. So, do I think that a glyph sequence can be used to code represent a bee utterance? Yes definetly, just as w:fr:DanceWriting can be used to represent dance. I used glyph rather than character, because – at least to my mind – glyph represent a larger set. But if "character strings" is more clear, let's use that.

3) I think you have a very good point, but I'm afraid that as I'm writing this I'm far too tired to provide a relevant answer right now. So I'll delay until I had some rest, sorry.

4) Well, I'm sorry, I do agree I'm late, I did attempted to participate in the past, but never found occasion to give more feedback earlier. All the more I expanded my knowledge about linguistic and practiced in various other ways as a Wikimedian…

Denny (talkcontribs)

0) The use case and requirements is to support the Wiktionaries. So the coverage is kinda given by "whatever the Wiktionaries have now", and the model has to be good enough to cover that. Going beyond that is nice and well, but only if it doesn't get more complicated. As simple as possible, as complex as required to serve the Wiktionaries - that's the primary requirement. If at the same time we can follow best practices from research - just as we did for Wikidata and the research in Knowledge Representation - the better - that would be the secondary requirement. So if there is a widely agreed on data model from linguistic research which at the same time fulfills the needs of the Wiktionaries, then I am super happy to just adopt it instead of invent something new. Because in this case the likelihood of third parties donating data or working with the data grows by a huge amount, since we are not inventing new stuff but building on existing stuff that is already established. This is why I think an alternative model doesn't have only to be strictly better, but strictly better by a sufficiently wide margin to jeopardize external adoption. I hope that makes any sense.

Basically, I would ask anyone who brings up an alternative model to show what exact use case in Wiktionary would not be served by the current proposed model and how their model serves it - and at the same time ensuring that all other use cases are still covered.

3) I'd be curious to hear, as I think that is one of the main use cases the data model has to fulfill.

(I'm skipping 1), 2) and 4), as I think they are not so central and won't contribute too much to a result. Let me know if you disagree)

Psychoslave (talkcontribs)

I'm ok with skipping 1), 2) and 4).

Regarding 3), I think that you are simply right about the flaw of the Logomer model.

I'm still wondering what is supposed to encompass in Lexeme class of the current model. Should it store affixes, stems, etymons, clitics, morphems (and possibly monemes), glossemes, compounds, lexical item, phrases, and other lexical units which don't even have English equivalent such as wikt:fr:lexie?, If so, I wonder if the term lexeme is still appropriate.

Psychoslave (talkcontribs)

Concerning requirements and examples of data that the model should be able to encompass, I will write a dedicated page. Maybe this week, but I'll have to allocate more time to local chapter concerns so I can promise any progress on this side for the forthcoming days.

Denny (talkcontribs)

I don't care that much about what the structures are named in the data model, and I wouldn't put too much weight on a definition of Lexeme - just as we never clearly defined what an Item is in Wikidata. In the end, everything that has a Wiktionary page will lead to one or more Lexemes, just as everything with a page in the other Wikimedia projects lead to Items. The important thing is, whether the structure works as a data structure - not what a Lexeme is. The word 'Lexeme' is merely a rough approximation, to convey a rough idea. 'Word' would have been equally possible, and inaccurate too - but in the end, it is just a rough, somewhat intuitive word for a data structure that needs to fulfill the requirements of the use cases.

Psychoslave (talkcontribs)
I don't care that much about what the structures are named in the data model
Well, it's very sad to hear you are careless about terminology, especially within a project that is aimed at helping lexicographers. If the model will keep this data structure, then definitively should use "word" instead of "lexeme".
just as we never clearly defined what an Item is in Wikidata
Isn't Wikidata glossary entry about item a clear definition? Maybe it was done as an afterthought, but it's here as far as I can see.
The important thing is, whether the structure works as a data structure - not what a Lexeme is.
The important thing is whether the structure helpful for Wikitionary contributors, and using clearly defined classes is a requirement of such a goal. Otherwise this model could just as well use "class1" instead of "lexeme", "class2" instead of "form", "attribute1" instead of lemma, and so on. As a data structure per se it would work just as well.
Word would have been equally possible, and inaccurate too - but in the end, it is just a rough, somewhat intuitive word for a data structure that needs to fulfill the requirements of the use cases.
Linguistic use "lexeme" precisely to avoid the vagueness of "word" (although depending on the linguistic school it will carry different specified meanings). Using "lexeme" is counter-intuitive, or at least, in complete opposition with the intent of the term. It favors the false impression that the model intent to treat the topic with carefully chosen terminology, when in fact it was carelessly arbitrarily elected through unspecified criteria. Also, it's seems very incompatible that on the one hand you say that the model should be erected on the solid body of linguistic knowledge founded over the last century, and on the other hand that you just don't care about using appropriate terminology regarding this same body of knowledge.
Denny (talkcontribs)

These are good points, and in hindsight, it sounds more dismissive than I meant it to be. Yes, Item has a definition that you point to - but if you really look at Wikidata you will find that this definition is not true. There are plenty of Items which are far from fulfilling that definition. And yet it is, I think, I a good thing to have such a definition, as it helps with understanding the model. It's a Wittgenstein ladder.

The same I would hold for the terminology here. In fact, I do think that the model should work as well if we would use attribute1 instead of lemma. But the latter is helpful in discussing the UI, the code, the model. Not because it is true.

The data model must fulfill the use cases, and if it is able to model solid linguistic theories, the better. But the exact terminology should be treated as a Wittgenstein's ladder - useful to gain an intuition, and for having evocative labels, but they should (and won't) restrain the community in getting their work done. If something - like the letter 'd' - is not regarded as a Lexeme in mainstream linguistic theories, that should not (and won't) stop the community from adding the letter 'd' as a Lexeme - just as the ontological status of many, say, Wikinews articles or Wikisource pages did not stop the community from creating items for them. And that's OK.

In the end, the exact labeling of the elements of the data structure won't be as important as what the community actually does with them and how they use it inside the Wiktionaries. In fact, a lot of the terminology is even hidden from most casual contributors - they might never see the term 'Lexeme' in the first place. Just as the word 'item' is not prominent in the Wikidata UI. But it is still useful to have a shared vocabulary for development and talking about the system.

I hope that makes sense and I am not contradicting myself too much.

Reply to "An alternative model proposal: logomer"

Yet an other model, with ''vocable'' as central class

Psychoslave (talkcontribs)

So, the logomer proposal being inadequate, but being still concerned with the lexem-class-centric model, here is an other model. This time I didn't came with any original fancy neologism, and used terms with existing ISO definitions when I found one. In all cases I gave online sources, but I also used some books to guide my reflection, especially Le dictionnaire de linguistique et des sciences du langage, Larousse, 2012 (ISBN 9782035888457, OCLC 835329846).


Here are the pertaining definitions for this new proposed model.

A unit of thought.
French: notion (Toute unité de pensée. Le contenu sémantique d'une notion peut être ré-exprimé en combinant d'autres notions qui peuvent être différentes d'une langue à l'autre.)
word or standalone expression for an entity that has linguistic, semantic and grammatical integrity
French: terme (mot ou expression isolée pour une entité qui a une intégrité linguistique, sémantique et grammaticale)
one of the meanings of a word
entity extraction
process that seeks to locate, classify, and tag atomic elements in text into predefined categories
word form (vocable)
contiguous or non-contiguous entity from a speech or text sequence identified as an autonomous lexical item
modification or marking of a lexeme that reflects its morpho-syntactic properties
inflected form (form)
form that a word can take when used in a sentence or a phrase
Note 1 to entry: An inflected form of a word is associated with a combination of morphological features, such as grammatical number and case.
lexeme that has, as a minimal property, a part of speech
abstract unit generally associated with a set of forms sharing a common meaning
process of making a linguistic unit function as a word
Note 1 to entry: Such a linguistic unit can be a single morph, e.g. “laugh,” a sequence of morphs, e.g. “apple pie” or even a phrase, such as “kick the bucket”, that forms an idiomatic phrase.
physical embodiment of a given concept
Inflectional paradigm
A class of words with similar inflection rules.
representation of the graphic characters of a source script by the graphic characters of a target script
representation of the sounds of a source language by graphic characters associated with a target language
primary data (vocable)
electronic representation of language data
linguistic information added to primary data
format in which the annotation is rendered, independent of its content
act of observing a property, with the goal of producing an estimate of the value of the property
act of measuring or otherwise determining the value of a property
method of data collection in which the situation of interest is watched and the relevant facts, actions and behaviours are recorded
statement of fact made during an audit or review and substantiated by objective evidence
instance of applying a measurement procedure to produce a value for a base measure

The model

I used plantuml class diagram generator for generating this picture, so it's maybe not as nice, but it's far mor flexible. Actually if someone might dare install the corresponding MediaWiki extension I might just copy/paste the text format here.

Note that vocable here is used as a mix of the two definitions to which is it appended in brackets to definition terms above.

A vocable-centered data model
Reply to "Yet an other model, with ''vocable'' as central class"

Relationships between representations

TJones (WMF) (talkcontribs)

It's not clear how relationships between different representations of different forms of a word will be represented. For example, "color" and "colour" are two representations of color. Similarly, "colors" and "colours" are two representations of colors, which is the plural of color. However, "colours" is not the plural of "color", and "colors" is not the plural of "colours".

Similarly, in Serbo-Croatian, Latin "pȁs" and Cyrillic "пас" (meaning "dog") are two representations of one lexeme, and "psȉ" and "пси̏" are two representations of another, related lexeme (the plural). How can the more specific relationship between pȁs/psȉ and пас/пси̏ be represented?

On a related note, would there be any explicit representation of the fact that "color" is an AmE variant and "colour" is a BrE/CanE variant, while "pȁs" is the Latin variant and "пас" the Cyrillic variant?

Psychoslave (talkcontribs)

I understand it, forms are direct embedded property (I'm not sure property match the Wikidata terminology here though), not independent items to which the proposed lexem structure can link to.

However, as I understand the glossary definition of property, "Each statement at an item page links to a property, and assigns the property one or several values, or some other relation or composite or possibly missing value", it's should be possible to link plural forms and other relations between forms as statements on the lexem item, shouldn't it?

TJones (WMF) (talkcontribs)

Sounds plausible. As long as there is some way to indicate the relationship. This is going to be such an awesome resource for computational language nerds.

Denny (talkcontribs)

Forms will have identity and can be referred and linked to directly. They are not independent of the Lexemes (they always belong to one and only one Lexeme and depend on the existence of that Lexeme), but they still get an identity and can be directly linked to, which is useful for such properties as "rhymes with" or "anagram of".

Psychoslave (talkcontribs)
Denny (talkcontribs)

Possible? Technically: yes. Practically: probably not. The data model is not the main problem in this case. The community would need to agree on a language to use for the Voynich manuscript, approve that language code for inclusion, then we need to add items for "unknown grammatical function", enter every occurrence of a token as a form with unknown grammatical markers, connect them to unlabeled lexemes. Technically there is no problem - the data model is certainly flexible enough to accommodate the use case -, practically I don't see that happening in Wikidata itself.

But I could totally see it happening in an external instance, and in fact it would be a great use case, as the statements model is very flexible and would allow to add competing theories, references, to allow to point to the occurrences, etc. This all could be a rather nice collaborative tool for people trying to decipher the Voynich manuscript and to collect all that is currently known and theorized.

But unless the Wiktionaries actually already try to cover the language of the Voynich manuscript, I do have to wonder whether this is actually a requirement for the data model, or whether this is merely a theoretical question.

Reply to "Relationships between representations"
LA2 (talkcontribs)

In the current text, the "English noun bank" is used as an example. The text reads: "(e.g. "financial institution" and "edge of a body of water" for the English noun bank)". But if you look in en.wiktionary for "bank", the entry is structured as 4 different etymologies, many of which take both the form of a noun and a verb. The first etymology derives from Italian banca, meaning bench, and refers to a financial institution, where there is a noun (a bank) and a verb (to put money in the bank), the second etymology refers to physical geography such as a beach, where there is again a noun and a verb. But here, in the WikibaseLexeme data model, nothing is mentioned about etymologies, only the triplet language, lemma, and part of speech. Why? Is this something that was forgotten by mistake, or is it a deliberate design? In other languages than English, the same lemma and part of speech might have different inflections for different etymologies. In Swedish, the plural is banker (financial institutions) and bankar (beaches), respectively.

Jpgibert (talkcontribs)


Furthermore, in french (but this fact exists in all languages) there are a lot of words for which the etymology is not sure. For example, the word Macabre has got 3 hypothesis about its etymology, the word Galimatias 2, etc.. It is important to take this in account in the model I think. And moreover, the hypothesis have not the same level of credibility. Thus, I think that it is interesting to provide a mechanism allowing to sort the hypothesis, IMHO.

Daniel Kinzler (WMDE) (talkcontribs)

Etymology is not mentioned in the model, because we expect it to be represented using Wikidata-Style "Statements". Statements give you exactly the power and flexibility you are asking for: you can have multiple competing statements with different sources, you can attach them to Lexemes or to individual Forms or Senses, you can mark them as preferred or deprecated, or qualify them using any property you like, use them to refer to Wikidata Items, etc.

Etymology of course is a complex topic, and I don't expect it to be covered exhaustively using Statements. The etymlogical information represented on Wikidata will be the machine readable minimum. For a thorough explanation, we'd still need text -- on Wiktionary, I expect.

As to the same lemma having different inflection based on etymology: if the inflection is different, it's not the same Lexeme in the sense of this model. In the proposed model, a Lexeme does not correspond directly to what is now on a Wiktionary page: A Wiktionary page would cover the lemma "bank" in all languages, all word classes, and all morphologies. In Wikidata, there will be one Lexeme for every morphology -- and thus, at least for each combination of language and word classes. But in some cases, there would even multiple Lexemes for the same language and word class, if they differ in morphology. In German for instance, there would be two distinct Lexemes modeling the nouns "die See" (the sea) and "der See" (the lake), because they differ in morphology, since they have different grammatical genders (to add to the confusion, "die Seen" is the plural form of both, "der See" and "die See"...).

Tropylium (talkcontribs)

This issue is basically an abridged form of the homonymy versus polysemy problem, for which there is no unambiguous solution always. Wiktionary draws one fairly hard line: different etymologies are taken as proof that e.g. bank is at least three homonymous words (the 'bench' ~ 'row' senses could probably be argued to be polysemic) instead of one or two. Other criteria could be used, such as difference in meaning + inflection. For Wikidata's uses, etymology is probably not the best choice, since Wikidata, IIUC, is not planning on formalizing etymology too much. (Speaking as an etymologist, this is a good idea. Etymologies are theories rather than facts, and any exhaustive formal model of them would have to operate on a probabilistic rather than binary logic.)

Note though that inflection alone does not work as a sufficient distinction between homonyms, given variation such as shit : past tense either shitted or shat. Moreover, note that this is not necessarily looser than the etymological condition either. By this criterion, e.g. grind : grinded 'to gather experience points in a video game' is a different word from grind : ground 'to make into powder', while by the etymological criterion it's a single word with variable inflection.

Psychoslave (talkcontribs)

I completely support your indication about the theoretical quality of etymology.

However I'm rather confident that this kind of that can be structured into a model which doesn't constraint to subsume "one hypothesis set to rule them all".

Looking at the Wikidata glossary, I think that the way claims and statement are defined, with ability to add qualifiers, ranks and sources form a good framework for presenting multiple theories. Maybe ranks labels wouldn't fit well, but with qualifier you should be able to say that a theory have active suporters or not, whether it was proven or invalidated by some practical means… Actually, you might think about making statements about a statement and so on, I don't know if it is currently possible within Wikidata, maybe someone like @Lydia Pintscher (WMDE): could confirm/infirm that.

Lydia Pintscher (WMDE) (talkcontribs)

Yeah there is a lot of what you can do with qualifiers. Have a look at the item for Barack Obama for example and how his country of citizenship is modeled. Or the country statements for Jerusalem. It is not possible to make statements on statements but there usually is a way to do it with qualifiers, ranks and references.

Psychoslave (talkcontribs)

Thank you @Lydia Pintscher (WMDE): for this examples.

More on a side note out of curiosity, but was it purposefully chosen NOT to allow statements on statements, or is it an idea which wasn't raised? In the former case, if you do had some discussion on the topic, I would be interested to read it. Also while I'm it: - is it technically possible to make a wikidata item about a wikidata statement? - is it allowed/forbidden/undiscussed/other within Wikidata?

As said, that is really curiosity, and I do agree that qualifiers already offer much flexibility.

Lydia Pintscher (WMDE) (talkcontribs)

It was a design choice at the very beginning of the project because it would introduce a huge amount of complexity in the UI/API/data model for what seems like very little gain. I don't think there was ever much discussion about it. Items about statements: In theory possible but... not sure again how useful it is. I guess we'll need to discuss concrete cases and go from there.

Psychoslave (talkcontribs)

I think I should make more contribution to Wikidata to have a feeling of it in order to give more concrete cases that might be relevant concrete cases, or realize I could not come with any relevant case.

Reply to "Etymologies"
Psychoslave (talkcontribs)

The given definition of lexeme begins with "A Lexeme is a lexical element of a natural language" and the model indeed have a language field.

A more flexible way to represent this information would be to put that in statements like "appears in discourses categorized as being expressed in ${language}". Once again, flexibility of claims could help to render the fact that in such or such discourse, the lexeme is considered endogenous of the surrounding discourse, loan word, an exogenous verbatim quote, and so on.

Thus the "language" might always be inferred, but wouldn't be engraved in the model with a rigidity which doesn't reflect versatility of language practices.

On a side note, the "natural language" term is itself carrying very biased assumption on languages, see this Wikiversity research article for more development on this topic.

Lydia Pintscher (WMDE) (talkcontribs)

Can you point me to a case that would be problematic? I'll need to understand this better and an example might help.

VIGNERON (talkcontribs)

I removed the « natural » adjective which seems unnecessary and confusing to me (plus, there is a lot of « un-natural » - whatever it supposed to mean - lexeme on the wiktionaries).

Psychoslave (talkcontribs)

Some examples where already given in Wikidata talkpage on Wiktionary by @Tropylium:.

I could add an other with example like fixed locution borrowed from latin such as in situ which might have misc. pronunciations, but share the same written form and (at least one) meaning. Funnily that's also a case where several lexical categories do share the same exact meaning in a given language (at least if you trust the English Wiktionary).

Psychoslave (talkcontribs)

Any feedback on this examples @Lydia Pintscher (WMDE):?

(Sorry to ping you that much, but as the interface doesn't provide a way to indicate that a message was well received, I don't have many other way to know if it wasn't just missed. That might be suggestion to make in phabricator maybe, I like the "Thanks" button, and having a "Mark red" and "Read later" next to it would seem interesting now that I'm thinking about it.)

Lydia Pintscher (WMDE) (talkcontribs)
Reply to "Language"