Extension talk:WikibaseLexeme/Data Model
Add topicThis page used the Structured Discussions extension to give structured discussions. It has since been converted to wikitext, so the content and history here are only an approximation of what was actually displayed at the time these comments were made. |
Remove two sentences
[edit]Thanks for the draft! It looks great. I would suggest to remove or turn into a sidenote the following two sentences in the section on statements of senses:
"However, such a connection should not be interpreted as the lexeme actually representing the concept defined by the item (compare lemon:LexicalSense and lemon:LexicalConcept). In particular, if two lexemes have senses that refer to the same concept in this way, this does not imply that the two lexemes are synonyms."
Not because I think it is wrong, but because I think that the semantics of properties and thus statements instantiating these properties is up to the community, i.e. if the community wants to state that a certain property is meant to be used synonymous, it is OK for them to state so. The data model should be agnostic to such decisions, in my opinion. Denny (talk) 16:36, 15 March 2017 (UTC)
- Thanks for the feedback, Denny! You are right, and is basically a usage note, and not normative. I added it to address a common source of confusion and misunderstandings.
- It would be better suited for a page dedicated to statement level modeling of lexemes. But there is no such page, and we are reluctant to create one, exactly because such modeling should be left to the community.
- I'm a bit torn here - on the one hand, I think we should share our ideas and concerns about that level of modeling; on the other hand, we should be careful not to impose our ideas on the community. Daniel Kinzler (WMDE) (talk) 17:09, 15 March 2017 (UTC)
- I agree. But the Data Model page should be as normative and crisp as possible, imho. One can still share their opinions and ideas in many other venues, but be very careful of a privileged position to put such ideas into normative documents. Just my 2c. Denny (talk) 17:12, 15 March 2017 (UTC)
Grammatical features
[edit]I was wondering whether it is required that each Form has at least one Grammatical Feature. I'd tend to say so, as it is morphology that actually founds the form, kind of. Should this be required, do we want to mention it in the data model description?
Related to that: Should there always be just a single Form of the Lexeme with a given set of Grammatical Features? Again, I'd tend to think so (I am not entirely sure, though). Is data model description to mention this kind of "restrictions"?
And regarding the text in this section. I am not very convinced by this sentence "Multiple grammatical features can be combined to express that a form shall be used when all these features apply.", namely "a form shall be used" part of it. I'd expect model of language data to be descriptive not normative. How about simple "From can express multiple grammatical features" (meh, I am not really happy about this phrase either)? I'm not sure though if it is just a personal preference. Leszek Manicki (WMDE) (talk) 09:47, 16 March 2017 (UTC)
- Just realized that saying each Form has to have at least one Grammatical Feature was quite a statement but not really thought far enough :)
- While I believe for some lexical categories (word classes) in some languages (e.g. nouns or verbs in some languages I know) this would be valid. But it must not necessarily hold for forms of lexemes of other lexical categories of other languages. And what's more obvious when thinking of "prepositions" or "conjunctions" in some European languages, say in English, those are not really carrying grammatical markers, don't they? Or at least it seems pretty possible to me to model those without them having any grammatical feature.
- So I think Data Model should not say each Form 1+ Grammatical Feature. Leszek Manicki (WMDE) (talk) 08:32, 22 March 2017 (UTC)
- Good point regarding (some) function words (in some languages) not really having grammatical features. So perhaps this should not be a hard requirement after all... Similarly, I do not think that the set of features is always unique. We can't require that.
- Let's take "a" as an example:
- Lemma: "a"
- Language: -> English
- Category: -> article
- Statements:
- Syntact function: -> indefinite article
- Forms:
- "a"
- features: n/a
- "an"
- features: n/a
- "a"
- Senses: n/a
- There are no senses, and two forms which both have no grammatical features...
- Regarding the "shall be used": I was trying to say here that the language's grammar requires the use of the given form in the context described by the features. But I'll try to re-phrase. Daniel Kinzler (WMDE) (talk) 13:35, 23 March 2017 (UTC)
- Fixed, I hope: https://www.mediawiki.org/w/index.php?diff=2429324&oldid=2429252&title=Extension%3AWikibaseLexeme%2FData_Model&type=revision Daniel Kinzler (WMDE) (talk) 13:46, 23 March 2017 (UTC)
- Thanks Daniel. It now makes your point clear.
- Still one could argue that grammar just describes that language, not requires to use A or B (descriptive grammar vs. normative grammar). My linguist alter ago is a bit more in favour of this stance but I am fine with having it like you said it. It is clear and correct. Let's see what readers say. Leszek Manicki (WMDE) (talk) 08:00, 28 March 2017 (UTC)
- Per the "a" example. Yeah, it's seems good to not require some grammatical feature to be present. The way "a" is modelled seems fine. There could be alternative modelling where "a"/"an" would have a gram. feature related to it signifying the "indefinitiness" of the related word (as opposed to "the" signifying "definedness").
- But: data model should allow both ways of modelling. I am pretty sure the "a" problem is still fairly compared to what can be found in other languages. So I am all for making DM flexible and not imposing. Thanks for discussing it a bit further. I am pretty convinced now. Leszek Manicki (WMDE) (talk) 08:04, 28 March 2017 (UTC)
- I agree that there should be no such a field in the lexem structure, all the more when they could be expressed as statements (more specifically as "claims", as far as I understand) which offer more flexibility like allowing to specify in which theory the grammatical feature is proposed. Psychoslave (talk) 04:19, 26 August 2017 (UTC)
Paradigms
[edit]The current description seems to suggest that the forms of a word should be enumerated separately for each word. This is not how it has been done in Wiktionary until now. In Wiktionary, each word calls a template that creates the forms and many words use the same template. The German words Band and Land use the same template because their declension follow the same pattern (called a 'paradigm' among linguists), but Hand and Wand use another template since their declension follows another pattern. If all the forms should be enumerated for each word, it is likely that one form will be wrong by mistake. This risk is minimized if a limited set of standard patterns or paradigms (templates) are used as an intermediary. LA2 (talk) 01:41, 25 March 2017 (UTC)
- Data Model as proposed does not close the possibility of "generating" Forms based on the inflection/etc paradigm. Actually, this would be a way I'd expect them to being added in many cases.
- If the concern here was that some particular form could be e.g. changed to be wrong (i.e. not match the form dictated by the paradigm), I strongly believe the Community and developers can come up with ways of ensuring such cases are identified and fixed (could be bot, gadget, probably multiple other ways I cannot think of just now). I don't think it is the concern of the data model, though. Leszek Manicki (WMDE) (talk) 08:11, 28 March 2017 (UTC)
- Saying that it is not a problem and that bots could detect and fix any inconsistencies is similar to saying that interwiki/interlanguage links between Wikipedia articles can be detected and fixed by bots. That is indeed how it used to work before Wikidata was created to fix the problem that the bots indeed did not succeed in fixing all inconsistencies. Wikidata exists because this is a problem with the data model. LA2 (talk) 16:24, 28 March 2017 (UTC)
- It would be great to be able to represent and capture paradigms. But I think that this is a bit more complex and should be left for later. I indeed would think that there will be a future development stage, where a way to type a Lexeme with a certain paradigm will be possible, and then the system will execute some (Lua?) code and create the forms automatically.
- Whereas I agree that this would be great to have from the beginning, I think it would make the initial system too complex to start with. Quite consciously, the first version of Wiktionary support is very dumb and simple, in order to figure out and fix the possible errors that happen at this stage already. Once this is settled, we will have reached a point where it makes sense to plan, design and implement paradigm support.
- So, yes, I think you are right, that it is important and should be done as soon as possible, but I am afraid we are not smart enough to figure out how to do this right from the get-go, and that it has been traditionally a good practice for Wikimedia software projects to do thinks incrementally.
- Note that in the Wiktionaries themselves, there is nothing to tell the Wiktionaries to stop using their existing solutions for paradigms. In fact, I do not expect those to become obsolete until Wikidata implements native support for paradigms.
- I know that Daniel Kinzler has been thinking along these lines too for quite a while. Denny (talk) 16:58, 6 April 2017 (UTC)
- I have been working for some time on the generation of verb forms for Lithuanian, which is a highly inflected language, and I am going to publish the data as LOD soon. You can find an example here (the RDF is not complete yet and some parts are still experimental). Is that what you are after when you talk about "generating from the paradigm"? Nvitucci (talk) 23:30, 6 April 2017 (UTC)
- Even if we have automatic generation, we will need the ability to explicitly model forms, for odd cases, and for cases where we want to make statements about these forms.
- I hope that "soon" after Lexemes go live, there will be a way to write Lua code that would simply take the entire Lexeme object as input, and generates Form objects as output, which are then shown on the Lexeme page. And when you edit such an "automatic" form, it becomes "real". Duesentrieb ⇌ 21:04, 11 May 2017 (UTC)
- What about having some item for each "paradigm", and a claim that lexem forms are related by this paradigm? Thus you can both have explicit forms produced by whatever mean, and if you want to (re)generate them using a paradigm on the lemma, you also have all the data required.
- Now, surely what such a paradigm item should contains is an other point. Should it contains code implementations, for example. What about the nomenclature? Psychoslave (talk) 04:34, 26 August 2017 (UTC)
- I like the idea of making an explicit connection between a word and its paradigm, independently of whether it is used to to create the Forms or not. Such explicit information will be useful for other reasons too. So, yes, fully agreed - in my opinion, there should be a property connecting Lexemes with Paradigm items. Denny (talk) 15:52, 28 August 2017 (UTC)
When multiple words forms one word
[edit]In some Indian languages, like Gujarati, several dictionaries use model of prefix- suffix OR two words OR more words to define words or suggest its origin. Especially words originated from Sanskrit. For example અંત્યોદય is formed by two words અંત્ય and ઉદય. It is represented by અંત્ય+ઉદય and followed by their meaning which forms the word.
In English: It is something like word Netizen = Net+citizen; the citizen who uses net, user of internet.[1] Another example.. Presummer = Pre+Summer; Before Summer. [2]
So how will current data model include this? Regards Nizil Shah (talk) 18:44, 9 April 2017 (UTC)
- I expect this would be modeled using Statements, via a Property like "composed of", or the more general "derived from". So, for instance "classroom" would have two statements like this:
- composed of: "class" (English noun, L786347)
- compound ordinal: 1
- composed of: "room" (English noun, L255348)
- compound ordinal: 2 Duesentrieb ⇌ 20:58, 11 May 2017 (UTC)
- composed of: "class" (English noun, L786347)
- Wow. Thank you for clarifying. It would certainly work. Nizil Shah (talk) 04:47, 12 May 2017 (UTC)
Etymologies
[edit]In the current text, the "English noun bank" is used as an example. The text reads: "(e.g. "financial institution" and "edge of a body of water" for the English noun bank)". But if you look in en.wiktionary for "bank", the entry is structured as 4 different etymologies, many of which take both the form of a noun and a verb. The first etymology derives from Italian banca, meaning bench, and refers to a financial institution, where there is a noun (a bank) and a verb (to put money in the bank), the second etymology refers to physical geography such as a beach, where there is again a noun and a verb. But here, in the WikibaseLexeme data model, nothing is mentioned about etymologies, only the triplet language, lemma, and part of speech. Why? Is this something that was forgotten by mistake, or is it a deliberate design? In other languages than English, the same lemma and part of speech might have different inflections for different etymologies. In Swedish, the plural is banker (financial institutions) and bankar (beaches), respectively. LA2 (talk) 23:21, 9 May 2017 (UTC)
- Hi,
- Furthermore, in french (but this fact exists in all languages) there are a lot of words for which the etymology is not sure. For example, the word Macabre has got 3 hypothesis about its etymology, the word Galimatias 2, etc.. It is important to take this in account in the model I think. And moreover, the hypothesis have not the same level of credibility. Thus, I think that it is interesting to provide a mechanism allowing to sort the hypothesis, IMHO. Jpgibert (talk) 07:17, 10 May 2017 (UTC)
- Etymology is not mentioned in the model, because we expect it to be represented using Wikidata-Style "Statements". Statements give you exactly the power and flexibility you are asking for: you can have multiple competing statements with different sources, you can attach them to Lexemes or to individual Forms or Senses, you can mark them as preferred or deprecated, or qualify them using any property you like, use them to refer to Wikidata Items, etc.
- Etymology of course is a complex topic, and I don't expect it to be covered exhaustively using Statements. The etymlogical information represented on Wikidata will be the machine readable minimum. For a thorough explanation, we'd still need text -- on Wiktionary, I expect.
- As to the same lemma having different inflection based on etymology: if the inflection is different, it's not the same Lexeme in the sense of this model. In the proposed model, a Lexeme does not correspond directly to what is now on a Wiktionary page: A Wiktionary page would cover the lemma "bank" in all languages, all word classes, and all morphologies. In Wikidata, there will be one Lexeme for every morphology -- and thus, at least for each combination of language and word classes. But in some cases, there would even multiple Lexemes for the same language and word class, if they differ in morphology. In German for instance, there would be two distinct Lexemes modeling the nouns "die See" (the sea) and "der See" (the lake), because they differ in morphology, since they have different grammatical genders (to add to the confusion, "die Seen" is the plural form of both, "der See" and "die See"...). Daniel Kinzler (WMDE) (talk) 11:57, 10 May 2017 (UTC)
- This issue is basically an abridged form of the homonymy versus polysemy problem, for which there is no unambiguous solution always. Wiktionary draws one fairly hard line: different etymologies are taken as proof that e.g. bank is at least three homonymous words (the 'bench' ~ 'row' senses could probably be argued to be polysemic) instead of one or two. Other criteria could be used, such as difference in meaning + inflection. For Wikidata's uses, etymology is probably not the best choice, since Wikidata, IIUC, is not planning on formalizing etymology too much. (Speaking as an etymologist, this is a good idea. Etymologies are theories rather than facts, and any exhaustive formal model of them would have to operate on a probabilistic rather than binary logic.)
- Note though that inflection alone does not work as a sufficient distinction between homonyms, given variation such as shit : past tense either shitted or shat. Moreover, note that this is not necessarily looser than the etymological condition either. By this criterion, e.g. grind : grinded 'to gather experience points in a video game' is a different word from grind : ground 'to make into powder', while by the etymological criterion it's a single word with variable inflection. Tropylium (talk) 20:20, 28 August 2017 (UTC)
- I completely support your indication about the theoretical quality of etymology.
- However I'm rather confident that this kind of that can be structured into a model which doesn't constraint to subsume "one hypothesis set to rule them all".
- Looking at the Wikidata glossary, I think that the way claims and statement are defined, with ability to add qualifiers, ranks and sources form a good framework for presenting multiple theories. Maybe ranks labels wouldn't fit well, but with qualifier you should be able to say that a theory have active suporters or not, whether it was proven or invalidated by some practical means… Actually, you might think about making statements about a statement and so on, I don't know if it is currently possible within Wikidata, maybe someone like @Lydia Pintscher (WMDE): could confirm/infirm that. Psychoslave (talk) 07:56, 29 August 2017 (UTC)
- Yeah there is a lot of what you can do with qualifiers. Have a look at the item for Barack Obama for example and how his country of citizenship is modeled. Or the country statements for Jerusalem. It is not possible to make statements on statements but there usually is a way to do it with qualifiers, ranks and references. Lydia Pintscher (WMDE) (talk) 09:03, 29 August 2017 (UTC)
- Thank you @Lydia Pintscher (WMDE): for this examples.
- More on a side note out of curiosity, but was it purposefully chosen NOT to allow statements on statements, or is it an idea which wasn't raised? In the former case, if you do had some discussion on the topic, I would be interested to read it. Also while I'm it:
- - is it technically possible to make a wikidata item about a wikidata statement?
- - is it allowed/forbidden/undiscussed/other within Wikidata?
- As said, that is really curiosity, and I do agree that qualifiers already offer much flexibility. Psychoslave (talk) 13:05, 29 August 2017 (UTC)
- It was a design choice at the very beginning of the project because it would introduce a huge amount of complexity in the UI/API/data model for what seems like very little gain. I don't think there was ever much discussion about it.
- Items about statements: In theory possible but... not sure again how useful it is. I guess we'll need to discuss concrete cases and go from there. Lydia Pintscher (WMDE) (talk) 14:20, 29 August 2017 (UTC)
- I think I should make more contribution to Wikidata to have a feeling of it in order to give more concrete cases that might be relevant concrete cases, or realize I could not come with any relevant case. Psychoslave (talk) 15:08, 29 August 2017 (UTC)
Link to labels
[edit]Every label should have its equivalent here. How is this accomplished? GerardM (talk) 19:20, 9 June 2017 (UTC)
- By people. Daniel Kinzler (WMDE) (talk) 09:07, 28 June 2017 (UTC)
Relationships between representations
[edit]It's not clear how relationships between different representations of different forms of a word will be represented. For example, "color" and "colour" are two representations of color. Similarly, "colors" and "colours" are two representations of colors, which is the plural of color. However, "colours" is not the plural of "color", and "colors" is not the plural of "colours".
Similarly, in Serbo-Croatian, Latin "pȁs" and Cyrillic "пас" (meaning "dog") are two representations of one lexeme, and "psȉ" and "пси̏" are two representations of another, related lexeme (the plural). How can the more specific relationship between pȁs/psȉ and пас/пси̏ be represented?
On a related note, would there be any explicit representation of the fact that "color" is an AmE variant and "colour" is a BrE/CanE variant, while "pȁs" is the Latin variant and "пас" the Cyrillic variant? TJones (WMF) (talk) 20:29, 9 August 2017 (UTC)
- I understand it, forms are direct embedded property (I'm not sure property match the Wikidata terminology here though), not independent items to which the proposed lexem structure can link to.
- However, as I understand the glossary definition of property, "Each statement at an item page links to a property, and assigns the property one or several values, or some other relation or composite or possibly missing value", it's should be possible to link plural forms and other relations between forms as statements on the lexem item, shouldn't it? Psychoslave (talk) 04:06, 26 August 2017 (UTC)
- Sounds plausible. As long as there is some way to indicate the relationship. This is going to be such an awesome resource for computational language nerds. TJones (WMF) (talk) 13:41, 28 August 2017 (UTC)
- Forms will have identity and can be referred and linked to directly. They are not independent of the Lexemes (they always belong to one and only one Lexeme and depend on the existence of that Lexeme), but they still get an identity and can be directly linked to, which is useful for such properties as "rhymes with" or "anagram of". Denny (talk) 22:11, 1 September 2017 (UTC)
- Would it possible to add text chunks of the Voynich manuscript with this Lexem model? Psychoslave (talk) 23:59, 1 September 2017 (UTC)
- Possible? Technically: yes. Practically: probably not. The data model is not the main problem in this case. The community would need to agree on a language to use for the Voynich manuscript, approve that language code for inclusion, then we need to add items for "unknown grammatical function", enter every occurrence of a token as a form with unknown grammatical markers, connect them to unlabeled lexemes. Technically there is no problem - the data model is certainly flexible enough to accommodate the use case -, practically I don't see that happening in Wikidata itself.
- But I could totally see it happening in an external instance, and in fact it would be a great use case, as the statements model is very flexible and would allow to add competing theories, references, to allow to point to the occurrences, etc. This all could be a rather nice collaborative tool for people trying to decipher the Voynich manuscript and to collect all that is currently known and theorized.
- But unless the Wiktionaries actually already try to cover the language of the Voynich manuscript, I do have to wonder whether this is actually a requirement for the data model, or whether this is merely a theoretical question. Denny (talk) 18:56, 2 September 2017 (UTC)
Relations to non-text data
[edit]First, should this lexeme structure be text-centric? That's a bias toward languages which can be assumed and defended, but then it should be done explicitly.
So given this bias, how should we relate a lexeme to misc. media? For example, should "apple" be linked to commons c:Category:Apples (or even c:Category:Apple) through statements? Should the model include a canonical picture of apple, as there is a canonical textual form (the lemma)?
What about illustration of verbs, for example "to fall"? Psychoslave (talk) 05:03, 26 August 2017 (UTC)
- That is up to the editors to decide. As far as I can see the examples you mention can be solved with statements. Lydia Pintscher (WMDE) (talk) 15:06, 27 August 2017 (UTC)
- I would love to see non-textual lexeme but I don't see how it would be feasible (that's a complex and complicated question running for years in the wiktionnaries and sadly, AFAIK, no realistic solution has been found).
- For your second question, there is something very strange in your explanation: the L-item "apple" would be about the English lexeme "apple", meanwhile the category on commons is not dependent of any languages.
- It will be to be decided, I'm not sure how but I think the solution will be something along this: all the L-item ("apple", "Apfel", "pomme", "aval", etc.) will be linked somehow to Q89, the Q-item for the concept of apple, which is already linked to Commons pages.
- Finally and obviously: yes, pictures should be included. VIGNERON (talk) 17:50, 27 August 2017 (UTC)
Lemma
[edit]What the point of having a lemma in this model? Lemma can be given as a statement related to a form, and in this case you can even have additional flexibility to specify in which tradition this would preferably used as a lemma. Actually, it would be even more parsimonious, comprehensive and neutral to put statement like "verbs use infinitive form as lemma" (which is not the case for Latin and Greek but a claim could specify in which context it holds) in a separate item, wouldn't it?
If the lexeme provide all derivable forms, then it's far enough far querying them using whatever canonical form one is accustomed to. At the most, this should be considered in the query interface design, but to my mind it doesn't have any interest in the model. But I might be missing something, of course. Psychoslave (talk) 05:20, 26 August 2017 (UTC)
- We need it for example for things like listings and selecting other lexemes in drop-down menus when making statements. For this we need some part of the model to be special and fixed unlike some of the other parts that can be much more flexible. Lydia Pintscher (WMDE) (talk) 15:09, 27 August 2017 (UTC)
- That's an interesting point. It doesn't really solve the neutrality concern, but it does inspire me with other proposals that (hopefully) should solve both the neutrality and the ability to display relevant information in the drop-down menus. I will create a dedicated thread once I matured a bit this ideas. Psychoslave (talk) 13:19, 29 August 2017 (UTC)
Language
[edit]The given definition of lexeme begins with "A Lexeme is a lexical element of a natural language" and the model indeed have a language field.
A more flexible way to represent this information would be to put that in statements like "appears in discourses categorized as being expressed in ${language}". Once again, flexibility of claims could help to render the fact that in such or such discourse, the lexeme is considered endogenous of the surrounding discourse, loan word, an exogenous verbatim quote, and so on.
Thus the "language" might always be inferred, but wouldn't be engraved in the model with a rigidity which doesn't reflect versatility of language practices.
On a side note, the "natural language" term is itself carrying very biased assumption on languages, see this Wikiversity research article for more development on this topic. Psychoslave (talk) 08:37, 26 August 2017 (UTC)
- Can you point me to a case that would be problematic? I'll need to understand this better and an example might help. Lydia Pintscher (WMDE) (talk) 15:05, 27 August 2017 (UTC)
- I removed the « natural » adjective which seems unnecessary and confusing to me (plus, there is a lot of « un-natural » - whatever it supposed to mean - lexeme on the wiktionaries). VIGNERON (talk) 17:32, 27 August 2017 (UTC)
- Some examples where already given in Wikidata talkpage on Wiktionary by @Tropylium: .
- I could add an other with example like fixed locution borrowed from latin such as in situ which might have misc. pronunciations, but share the same written form and (at least one) meaning. Funnily that's also a case where several lexical categories do share the same exact meaning in a given language (at least if you trust the English Wiktionary). Psychoslave (talk) 21:40, 27 August 2017 (UTC)
- Any feedback on this examples @Lydia Pintscher (WMDE): ?
- (Sorry to ping you that much, but as the interface doesn't provide a way to indicate that a message was well received, I don't have many other way to know if it wasn't just missed. That might be suggestion to make in phabricator maybe, I like the "Thanks" button, and having a "Mark red" and "Read later" next to it would seem interesting now that I'm thinking about it.) Psychoslave (talk) 13:10, 29 August 2017 (UTC)
- Sounds like something @Daniel Kinzler (WMDE): can say something about more than me. Lydia Pintscher (WMDE) (talk) 14:22, 29 August 2017 (UTC)
Lexical category
[edit](Note that redundant remarks on distinct fields are purposefully separated in dedicated threads so they can be discussed separately)
I think that the lexical category should also be tracked as statement in this model. This is because a lexical category only have legitimacy within a grammar theory, and a language can be analyzed with different, possible conflicting, grammatical theories.
To give a concrete example, look at lojban grammar, which uses endogenous terms to describe its own grammar with concepts which doesn't completely match those of classical scholarship grammars. But it doesn't mean that you couldn't try to interpret sentences using vocabulary of such a grammar. More generally, it shouldn't be assumed that there is a single theory that make general consensus as of its ability to describe every utterance of every language that could ever be produced. Some go as far as questioning "Are there any languages that appear to have no grammar?". And what about if someone come with a theory where grammatical categories of lexemes are dynamic?
So at a minimum, this field should allow multiple values which are as flexible as Wikidata claim structure. But then, why would it be not rendered as a specific field rather than as a statement? Psychoslave (talk) 10:17, 26 August 2017 (UTC)
Writing visual representation centrism
[edit]Maybe it's not what intended, but as described, the forms seems to be designed to convey only string representing visual written performances.
But the same data structure could be used to carry other speech performances as oral performances represented as IPA strings (and other misc. phonemic and phonetic encoding). That would be all the more interesting as some grammatical features only exhibit orally (see w:Alternation (linguistics), w:Variation (linguistics), w:Sandhi, w:Liaison (French), w:fr:Phonème éphelcystique inter alia).
I'm not skilled on this domain, but maybe w:SignWriting might be used to render sign language representations of a lexeme. I know that there are links between spoken and sign languages, but I don't know if there is an equivalent correspondence as between spoken and written language.
An other kind representation that could be stored along each lexeme is its braille notation. Psychoslave (talk) 04:58, 27 August 2017 (UTC)
- Yeah that is something we have to think about more but I'd like us to concentrate on what we have first.
- Some of the things you mentioned might be solvable by adding statements or finding a written form. I'll keep it in mind as we make progress. Lydia Pintscher (WMDE) (talk) 15:04, 27 August 2017 (UTC)
Grammatical Feature
[edit]- A form's grammatical features specify under which conditions or in which syntactic role that form is used
Ok, that's seems fine, but maybe a few more example of conditions (syntactic role being itself an example of situation/circumstance), for example:
- A form's grammatical features specify under which circumstances that form is used, like its syntactic role, the grammatical theory backing its usage, the regions where it does and doesn't applies
As the grammatical feature is an independant item, as far as I understand, it should be possible to add all that kind of information on the said item as statements anyway, but stating it is expected wouldn't hurt I guess. Psychoslave (talk) 05:25, 27 August 2017 (UTC)
An alternative model proposal: logomer
[edit]The more I'm reading and thinking about it, the more I'm inclined to consider that the model is trying to give a too rigid framework.
What we are interested to document in Wiktionaries, is chunks of discourses, and what is claimed about that chunks in such and such theories.
A lexeme is an abstract structure which is already far too committed into a closed theory of language, that is it doesn't provide space for presenting language analyzes which doesn't fit a lexemic structuring.
The mission of Wiktionaries is documenting all languages. Side note: this doesn't state written language, spoken language, or in fact even human languages, so depending on local consensus, you might see bee language documented.
What is aimed here, as far as I understand, is to propose a structured database model to support this aim.
So, the model must allow to document lexemes, sure. But that could be done as a lexemic relationship. For example cat and cats, Baum and Bäumen are two couples in lexemic relationships, that could be recorded as 4 distinct entities.
To really support goal of Wiktionary, the model must also allow to document w:lexical item, morphs, w:morphemes, etymoms and whatever discourse chunk a contributor might want to document and relate to other discourse chunks. A lexeme class can't do that, or you must come with such a distant definition of lexeme that it won't match any of the already too many existing one among linguistic literature.
I'm not aware of any consensual term for the "discourse chunk" in the sense I'm suggesting here (token doesn't fit either). So, in the rest of this message I'll use logomer (see wikt:en:logo- and wikt:en:-mer).
A discourse is any sign flow[note 1].
A glyph is any non-segmentable[note 2] sign that can be stored/recorded.
A logomer is a data structure which pertains to parts of a sequence of glyphes representing a discourse.
A logomer must have one or more representation.
A representation must have one or more form.
A single form must be elected as label.
A representation should indicate which representational systems it pertains to.[note 3]
A logomer must be related to one or more meaning.[note 4]
A logomer form must be extractable from a glyph sequence that represents a discourse.[note 5]
The extraction process of a logomer form must keep every unfiltered glyph.
The extraction process must not add any glyph.[note 6].
The extraction process must not alter any glyph.
A logomer form must include one or more glyph sequences (thereafter named "segment").
A segment must provide a glyph string.
A form including more than one segment must provide an ordinal for each segment.
A segment ordinal must indicate the relative position of a segment with respect to other segments of the form, relatively to the begin of discourses where it appears.
A segment might be void.
A void segment might serve as boundary marker, indicating possible positions for other segments which are not part of the current logomer.
All logomer forms of a single representation must be congruent under permutation.[note 7]
An indistinguishable logomer form might appear in multiple discourses.[note 8]
Distinct occurences of the same logomer forms with distinct meanings must induce distinct logomers.
Distinct meanings attributed to the same discourse parts should appears in a single logomer.
A logomer form might be taken as a discourse of its own.
- ↑ More criteria regarding meaning is purposefully set aside
- ↑ That is, in regard of the sign system used. For example a code point of a character encoding system could be segmented in several bits, but a bit is not a sign of the encoding system itself, even if a discourse using this system can make references to such a sign.
- ↑ For example, through statements. Accuracy of this information might be left to community. It could be things as vague as "casual oral retranscription" and "direct matching of written document", or more precise like "phonemic system of the International Phonetic Alphabet" and "official orthography in the Dutch spelling reform of 1996"
- ↑ Or definition, or whatever indication of its sense
- ↑ Discourses that can't be represented as a glyph sequence are not considered
- ↑ So boundaries markers as hyphen in morphs, like logo-, aren't part of a logomere
- ↑ That is, all forms have the exact same set of segments, only ordinal of this segments can change.
- ↑ But happaxes are logomer forms too, though
Psychoslave (talk) 20:55, 30 August 2017 (UTC)
- Actually, it's not yet a fixed model, clearly. In fact I already slimmed it deeply while creating the following graphical representation:
Visualization of an alternative to the Lexem data model for a Wikibase support of Wiktionary - However it might be too slim. Maybe keeping at least one mandatory field related to meaning (but valuable with a null value) would be better, whether on the logomer, or on the logomer form.
- This way it's possible to indicate a difference between "wikt:fr:grand homme" and "homme grand", the former being (in French variant I'm aware of) always used to indicate a famous person, while the later indicate a person is tall.
- But I'll already wait for feedback, especially from Noé, Benoît Prieur, Delarouvraie, Lyokoï, Jberkel, psychoslave, Lydia Pintscher, Thiemo Mättig, Daniel Kinzler, Epantaleo, Ariel1024, Otourly, VIGNERON, Shavtay, TaronjaSatsuma, Rodelar, Marcmiquel, Xenophôn, Jitrixis, Xabier Cañas, Nattes à chat, LaMèreVeille, GastelEtzwane, Rich Farmbrough, Ernest-Mtl, tpt, M0tty, Nemo_bis, Pamputt, Thibaut120094, JackPotte, Trizek, Sebleouf, Kimdime, S The Singer, Amqui, LA2, Satdeep Gill, Micru, Vive la Rosière, Malaysiaboy and Stalinjeet Psychoslave (talk) 21:14, 30 August 2017 (UTC)
- When you float away in your abstractions, you attract dreamers who like such abstractions, but repulse people who are able to sit down and do real and concrete work. Wikipedia is a success not because it is a perfect and abstract ideal of a theoretical model of knowledge, but because it is a simple tool for processing ASCII text. LA2 (talk) 21:26, 30 August 2017 (UTC)
- Sorry, but I don't understand what you want to do... Lyokoï (talk) 22:18, 30 August 2017 (UTC)
- I am pretty confused about the intent here as well... Amqui (talk) 22:58, 30 August 2017 (UTC)
- I was surprised that I'm listed here but I'm pretty confused as well with this. Tofeiku (talk) 06:08, 31 August 2017 (UTC)
- It is certainly true that we want to document things of unknown or even no meaning "AOI", "Nautron respoc lorni virch" or the archetypal meaningless phrases of philosophers such as "hig mig cig". Even then there is context - there is always context.
- Rich Farmbrough 11:45, 31 August 2017 (UTC).
Rich Farmbrough (talk) 11:45, 31 August 2017 (UTC) - Ok, it seems I need to explain more what I aim to provide here.
- Shortly, a data structure which target carrying less abstract data but allowing relationships useful for wiktionaries.
- So taking let's take the English adjective "hard" as a first example, so one might compare with current model examples.
- == Exemple 1: hard ==
- In this model the string (glyph sequence) "hard" might be recorded as following:
- Logomer: L64723
- Statements
- Label: hard (Form 1)
- (that is, the linearization of the segments, which here is a single element)
- used in discourse expressed in: English (Q1860)
- lexical category: adjective (Q34698)
- Derived from: heard (L112) Oldenglish adjective
- other statments might also add registers, glosses, definitions, synonyms, antonyms, translations, related concepts and so on
- Label: hard (Form 1)
- === Form 1 ===
- segments: hard
- segments in detail: (0, hard)
- Statements
- used within representation system of: written utterences (Q98123723736661235)
- prounonced as: hɑːd (L387483) (the logorem itself can indicate corresponding representation systems)
- Qualifiers:
- Region: Scotland (Q22)
- References: ...
- Qualifiers:
- prounonced as: hɑɹd
- Region: Scotland (Q22)
- References: ...
- prounonced as hard.ogg
- Qualifiers:
- Region: United States of America (Q30)
- Qualifiers:
- (Rhymes should be infered from associated phonetic logomers, which is even more important with cases with regional differences)
- === Form 2 ===
- There is no other indisputable form for hard in this model. But one might suggest that hard- in hardcore is just an other form of the present logomer. As said, that's disputable, but for the sake of the example, here is how this second affixal form would be represented with this model (so possibly in a distinct logomer):
- segments: "hard", "-"
- segments in detail: (0, "hard"), (1, AGLUNATIVE__MARK)
- 'The AGGLUNATIVE_MARK might be a special value, or a string containing a single Soft hyphen for example.
- segments in detail: (0, "hard"), (1, AGLUNATIVE__MARK)
- Statements
- …
- == Exemple 2: je me mis la tête à l’envers ==
- Now, here is a second example which doesn't come from those provided for the Lexeme model, but that might enlight what I had in mind while trying to outline a design for logomer.
- So, in French, "je me mis la tête à l’envers" is an inflectionned form of the phrase "fr:se mettre la tête à l’envers". In the model of logomers, each inflection have a single separated instance. That is "je me mis la tête à l’envers", "tu te mis la tête à l’envers", and "se mettre la tête à l’envers" are three diffrent logomers. Possibly they could group common statements in an other entity, but that's an other topic.
- Forms in logomers are here only to carry permutations and related statements such as grammatical acceptability in a given frame.
- For example "je me mis la tête complètement à l’envers", "je me mis gravement la tête à l’envers" and "à l’envers, je me mis la tête" are all less commonly heard but grammatically acceptable to my French native mind, and clearly are using instances of "je me mis la tête à l’envers".
- Thus "je me mis gravement la tête à l’envers" might be described as the following form
- segments: "je me mis", " ", "la tête", " ", "à l’envers"
- segments in detail: (0, "je me mis"), (1, SPECIFIER__MARK), (2, "la tête") (3, SPECIFIER__MARK) (4, "à l’envers")
- 'The SPECIFIER_MARK might be a concept entity such as "adjective", linearized as as simple space or "[…]" for display purposes.
- segments in detail: (0, "je me mis"), (1, SPECIFIER__MARK), (2, "la tête") (3, SPECIFIER__MARK) (4, "à l’envers")
- segments: "je me mis", " ", "la tête", " ", "à l’envers"
- And "à l’envers, je me mis la tête" might be described as the following form
- segments: "à l’envers", " ", "je me mis", " ", "la tête",
- segments in detail: (0, SPECIFIER__MARK), (1, "à l’envers") , (2, "je me mis"), (3, SPECIFIER__MARK), (4, "la tête")
- segments: "à l’envers", " ", "je me mis", " ", "la tête",
- Note that something like "me je tête l’envers mis la à", which certainly wouldn't be recognized as grammatical for a French speaker, doesn't fit any permutation of the segments proposed here, but nothing in the model prevent to document it in an other logomer.
- I hope it helps @LA2: , @Lyokoï: , @Amqui: and @Malaysiaboy: to grab my approach. Psychoslave (talk) 08:25, 1 September 2017 (UTC)
- Sorry for the french :
- Attends, c'est la version wikidata de wiktionary que t'essaie de faire là, non ? Écoute, je n'ai jamais pris le temps d'y faire quoi que ce soit. Je n'y ai pas envie d'y mettre du temps et de toute façon, je pense que ce n'est pas la bonne solution. Merci de me laisser à côté de ça. Je m'y impliquerai quand j'y verrais un intérêt pour le Wiktionnaire. Lyokoï (talk) 15:00, 1 September 2017 (UTC)
- (Je rajoute qu'en plus c'est en anglais, et que j'y comprend qu'à moitié...) Lyokoï (talk) 15:13, 1 September 2017 (UTC)
- @Psychoslave, thanks for the effort in trying to create a better model. I want to point out that the current proposal for Wikidata's Lexicographic model is not just thought up by the Wikidata team, but an adaptation of lexicographic data models that have been developed over the last century starting with TC 37, later under ISO as the Lexical Markup Framework, and then captured in RDF under the Lemon model. Wikidata is very much in that tradition, which means it is distilling literally the knowledge of hundreds of linguists over a century of work.
- Just to raise three points with your model:
- 1) whereas you claim that it also extends to Bee language, I am wondering whether this is actually a requirement. Wikidata's first (although not only) priority is to support its sister Wikimedia projects. Is there any Wiktionary that actually captures bee language? If we move to far away from our requirements we might create a solution that is more complex than necessary.
- 2) whereas you claim that the Bee language is a requirement, your model later is restricted to languages represented with glyphs. This seems contradictory to me? Did I miss something?
- 3) in your example for hard, you state that meanings and antonyms could be expressed via statements on the level of the Logomer. But antonyms are not pertaining to a specific Logomer, if I understand Logomers and antonyms correctly, but usually to a specific sense of the Logomer, i.e. to a specific definition. But I don't seem to be able to express the antonym relation on the definition. Maybe I am just missing something.
- Again, thank you for your proposal. It is unfortunate that it comes so late - the discussions about the data model were held years ago, fully in the open, and with wide invitations. It is not easy to fully appreciate such a contribution just a few months before the planned roll out of the project. Denny (talk) 22:06, 1 September 2017 (UTC)
- 0) I read the whole talk page of Wikidata4Wiktionary, so I was aware of the important analyze work you have done and used. I didn't yet read all the documentation about Lemon though. Anyway, my concern is not about the Lemon model, or the current proposed Lexeme model as a useful tool in many context, but in the very precise context of Wikidata4Wiktionary. If tradition seems a good fit for grounding our goals, great, let's leverage our work with it. Otherwise, let's set them aside, rather than sink under the weight of its unsuitable hypothesis.
- 1) If that's the case, I'm not aware of it. The bee language was of course an extreme example. I'm all for a simpler model. One which remove as much as possible from any linguistic theory while letting the ability to express them through its internal mechanisms. My current proposal seems still far too complicated and confusing for other contributors, so to my mind, it is not good enough either. Sticking to our requirements is great, but what are our requirements. I didn't saw the document exposing clearly this requirements, and how they were produced, so if such a document does exist, please let me know. To my mind, the requirement 0 is a class designed to store strings, going from d to wikt:bayerischer gebirgsschweisshund, but also including affixes such as wikt:-in-, morphs, and any sequence of characters one might encounter in the world. I tried to go further with the "ordered segments" of utterance, but that's maybe already a too complex model for our goals. Then the requirement 1, is to be able to document those strings, so those who encounter them can type them in a Wiktionary and discover where it is suspected to come from, whether it might mean something, its contrary or nothing at all depending on context. Yes, even strings with no actual (publicly known) meaning is worth documenting so people who encounter them can grab the knowledge of this substantiated absence of sense. And finaly, requirement 2 is to be able to glue all this pieces through relationships, preferably in a way that allow as much automated inferences. That's the basic requirements I would expect from a Wikidata4Wiktionary.
- 2) I think more probable that I didn't explicated my idea clearly enough, rather than you missed something I said distinctly. So my idea is that the data model about an utterance performance, but a recordable representation of such a performance. The representation only refer to the performance. Maybe a theater analogy would be more significant here: there is a written script before the show performance and you might have a recorded video, but the performance itself is gone. So, do I think that a glyph sequence can be used to code represent a bee utterance? Yes definetly, just as w:fr:DanceWriting can be used to represent dance. I used glyph rather than character, because – at least to my mind – glyph represent a larger set. But if "character strings" is more clear, let's use that.
- 3) I think you have a very good point, but I'm afraid that as I'm writing this I'm far too tired to provide a relevant answer right now. So I'll delay until I had some rest, sorry.
- 4) Well, I'm sorry, I do agree I'm late, I did attempted to participate in the past, but never found occasion to give more feedback earlier. All the more I expanded my knowledge about linguistic and practiced in various other ways as a Wikimedian… Psychoslave (talk) 01:29, 2 September 2017 (UTC)
- 0) The use case and requirements is to support the Wiktionaries. So the coverage is kinda given by "whatever the Wiktionaries have now", and the model has to be good enough to cover that. Going beyond that is nice and well, but only if it doesn't get more complicated. As simple as possible, as complex as required to serve the Wiktionaries - that's the primary requirement. If at the same time we can follow best practices from research - just as we did for Wikidata and the research in Knowledge Representation - the better - that would be the secondary requirement. So if there is a widely agreed on data model from linguistic research which at the same time fulfills the needs of the Wiktionaries, then I am super happy to just adopt it instead of invent something new. Because in this case the likelihood of third parties donating data or working with the data grows by a huge amount, since we are not inventing new stuff but building on existing stuff that is already established. This is why I think an alternative model doesn't have only to be strictly better, but strictly better by a sufficiently wide margin to jeopardize external adoption. I hope that makes any sense.
- Basically, I would ask anyone who brings up an alternative model to show what exact use case in Wiktionary would not be served by the current proposed model and how their model serves it - and at the same time ensuring that all other use cases are still covered.
- 3) I'd be curious to hear, as I think that is one of the main use cases the data model has to fulfill.
- (I'm skipping 1), 2) and 4), as I think they are not so central and won't contribute too much to a result. Let me know if you disagree) Denny (talk) 19:05, 2 September 2017 (UTC)
- I'm ok with skipping 1), 2) and 4).
- Regarding 3), I think that you are simply right about the flaw of the Logomer model.
- I'm still wondering what is supposed to encompass in Lexeme class of the current model. Should it store affixes, stems, etymons, clitics, morphems (and possibly monemes), glossemes, compounds, lexical item, phrases, and other lexical units which don't even have English equivalent such as wikt:fr:lexie?, If so, I wonder if the term lexeme is still appropriate. Psychoslave (talk) 13:57, 3 September 2017 (UTC)
- Concerning requirements and examples of data that the model should be able to encompass, I will write a dedicated page. Maybe this week, but I'll have to allocate more time to local chapter concerns so I can promise any progress on this side for the forthcoming days. Psychoslave (talk) 21:32, 3 September 2017 (UTC)
- I don't care that much about what the structures are named in the data model, and I wouldn't put too much weight on a definition of Lexeme - just as we never clearly defined what an Item is in Wikidata. In the end, everything that has a Wiktionary page will lead to one or more Lexemes, just as everything with a page in the other Wikimedia projects lead to Items. The important thing is, whether the structure works as a data structure - not what a Lexeme is. The word 'Lexeme' is merely a rough approximation, to convey a rough idea. 'Word' would have been equally possible, and inaccurate too - but in the end, it is just a rough, somewhat intuitive word for a data structure that needs to fulfill the requirements of the use cases. Denny (talk) 21:22, 4 September 2017 (UTC)
- I don't care that much about what the structures are named in the data model
- Well, it's very sad to hear you are careless about terminology, especially within a project that is aimed at helping lexicographers. If the model will keep this data structure, then definitively should use "word" instead of "lexeme".
- just as we never clearly defined what an Item is in Wikidata
- Isn't Wikidata glossary entry about item a clear definition? Maybe it was done as an afterthought, but it's here as far as I can see.
- The important thing is, whether the structure works as a data structure - not what a Lexeme is.
- The important thing is whether the structure helpful for Wikitionary contributors, and using clearly defined classes is a requirement of such a goal. Otherwise this model could just as well use "class1" instead of "lexeme", "class2" instead of "form", "attribute1" instead of lemma, and so on. As a data structure per se it would work just as well.
- Word would have been equally possible, and inaccurate too - but in the end, it is just a rough, somewhat intuitive word for a data structure that needs to fulfill the requirements of the use cases.
- Linguistic use "lexeme" precisely to avoid the vagueness of "word" (although depending on the linguistic school it will carry different specified meanings). Using "lexeme" is counter-intuitive, or at least, in complete opposition with the intent of the term. It favors the false impression that the model intent to treat the topic with carefully chosen terminology, when in fact it was carelessly arbitrarily elected through unspecified criteria. Also, it's seems very incompatible that on the one hand you say that the model should be erected on the solid body of linguistic knowledge founded over the last century, and on the other hand that you just don't care about using appropriate terminology regarding this same body of knowledge. Psychoslave (talk) 09:07, 5 September 2017 (UTC)
- These are good points, and in hindsight, it sounds more dismissive than I meant it to be. Yes, Item has a definition that you point to - but if you really look at Wikidata you will find that this definition is not true. There are plenty of Items which are far from fulfilling that definition. And yet it is, I think, I a good thing to have such a definition, as it helps with understanding the model. It's a Wittgenstein ladder.
- The same I would hold for the terminology here. In fact, I do think that the model should work as well if we would use attribute1 instead of lemma. But the latter is helpful in discussing the UI, the code, the model. Not because it is true.
- The data model must fulfill the use cases, and if it is able to model solid linguistic theories, the better. But the exact terminology should be treated as a Wittgenstein's ladder - useful to gain an intuition, and for having evocative labels, but they should (and won't) restrain the community in getting their work done. If something - like the letter 'd' - is not regarded as a Lexeme in mainstream linguistic theories, that should not (and won't) stop the community from adding the letter 'd' as a Lexeme - just as the ontological status of many, say, Wikinews articles or Wikisource pages did not stop the community from creating items for them. And that's OK.
- In the end, the exact labeling of the elements of the data structure won't be as important as what the community actually does with them and how they use it inside the Wiktionaries. In fact, a lot of the terminology is even hidden from most casual contributors - they might never see the term 'Lexeme' in the first place. Just as the word 'item' is not prominent in the Wikidata UI. But it is still useful to have a shared vocabulary for development and talking about the system.
- I hope that makes sense and I am not contradicting myself too much. Denny (talk) 17:02, 5 September 2017 (UTC)
Yet an other model, with vocable as central class
[edit]So, the logomer proposal being inadequate, but being still concerned with the lexem-class-centric model, here is an other model. This time I didn't came with any original fancy neologism, and used terms with existing ISO definitions when I found one. In all cases I gave online sources, but I also used some books to guide my reflection, especially Le dictionnaire de linguistique et des sciences du langage, Larousse, 2012 (ISBN 9782035888457, OCLC 835329846).
Definitions
[edit]Here are the pertaining definitions for this new proposed model.
- concept
- A unit of thought.
- source
- https://www.iso.org/obp/ui/#iso:std:iso:5963:ed-1:v1:en:term:3.2
- translations
- French
- notion (Toute unité de pensée. Le contenu sémantique d'une notion peut être ré-exprimé en combinant d'autres notions qui peuvent être différentes d'une langue à l'autre.)
- term
- word or standalone expression for an entity that has linguistic, semantic and grammatical integrity
- source
- https://www.iso.org/obp/ui/#iso:std:iso:18542:-2:ed-1:v1:en:term:3.1.12
- translations
- French
- terme (mot ou expression isolée pour une entité qui a une intégrité linguistique, sémantique et grammaticale)
- sense
- one of the meanings of a word
- entity extraction
- process that seeks to locate, classify, and tag atomic elements in text into predefined categories
- word form (vocable)
- contiguous or non-contiguous entity from a speech or text sequence identified as an autonomous lexical item
- inflection
- modification or marking of a lexeme that reflects its morpho-syntactic properties
- inflected form (form)
- form that a word can take when used in a sentence or a phrase
- Note 1 to entry: An inflected form of a word is associated with a combination of morphological features, such as grammatical number and case.
- word
- lexeme that has, as a minimal property, a part of speech
- lexeme
- abstract unit generally associated with a set of forms sharing a common meaning
- lexicalization
- process of making a linguistic unit function as a word
- Note 1 to entry: Such a linguistic unit can be a single morph, e.g. “laugh,” a sequence of morphs, e.g. “apple pie” or even a phrase, such as “kick the bucket”, that forms an idiomatic phrase.
- manifestation
- physical embodiment of a given concept
- Inflectional paradigm
- A class of words with similar inflection rules.
- transliteration
- representation of the graphic characters of a source script by the graphic characters of a target script
- transcription
- representation of the sounds of a source language by graphic characters associated with a target language
- primary data (vocable)
- electronic representation of language data
- annotation
- linguistic information added to primary data
- representation
- format in which the annotation is rendered, independent of its content
- observation
- act of observing a property, with the goal of producing an estimate of the value of the property
- act of measuring or otherwise determining the value of a property
- method of data collection in which the situation of interest is watched and the relevant facts, actions and behaviours are recorded
- statement of fact made during an audit or review and substantiated by objective evidence
- instance of applying a measurement procedure to produce a value for a base measure
The model
[edit]I used plantuml class diagram generator for generating this picture, so it's maybe not as nice, but it's far mor flexible. Actually if someone might dare install the corresponding MediaWiki extension I might just copy/paste the text format here.
Note that vocable here is used as a mix of the two definitions to which is it appended in brackets to definition terms above.

Psychoslave (talk) 21:28, 3 September 2017 (UTC)
Form types
[edit]@Daniel Kinzler (WMDE): regarding the Form Types suggested here, doesn't a "no value" fulfill the same function? Denny (talk) 15:59, 10 January 2018 (UTC)
Features
[edit]I see "features" as separate from "statements" in the proposed data model. Will they be modelled as property-value pairs or a different data structure?
Over at d:Wikidata:Property proposal/Lexemes, a number of properties are being proposed, like "person", "gender", "number", which will fit into the "features" component of the Lexeme data model. Deryck C.Meta 12:53, 25 April 2018 (UTC)
- Hello Deryck, I hope I understand your question right.
- The so-called features are for example: the lemma, the language of the lemma, the language of the lexeme, the lexical category. In the forms, the representation and its language. These pieces of information are not represented by triples, but it's going to be a simple field (a bit like the label and description in items). Some of these fields will have autocompletion from Wikidata items.
- If you want to look at what it will look like, you can try the demo system (information is not necessarily correctly modeled there, it's mostly a sandbox try the interface)
- Let us know if you have further questions :) Lea Lacroix (WMDE) (talk) 06:20, 26 April 2018 (UTC)
- Yes that makes sense. We aren't separating grammatical features by category (or properties). Deryck C.Meta 09:18, 26 April 2018 (UTC)
- Will the lemma (the Lexeme itself) have a "grammatical features" field? I only see that forms have " grammatical features" but it seems that the Lexeme doesn't. For example, how do we represent the fact that "chien" (fr) is masculine regardless of form? Deryck C.Meta 20:42, 29 April 2018 (UTC)
- No, the grammatical features are only included in the Forms. If you want to indicate something about the lexeme, you can decide to have a dedicated property and add it in a statement. Lea Lacroix (WMDE) (talk) 07:41, 30 April 2018 (UTC)
Language learning features
[edit]Is it possible to include fields at the Lemma or Senses level that may later support language learning?
I'm a daily user of wiktionary mainly because I've been learning languages already for many years. I would say that the use of wiktionary for my mother language could be a tenth or a hundredth compared to its use for second language learning.
In learning a language it could be helpful to be able query the wiktionary for sense by its frequency. There are frequency list already, but I've not seen yet one that does so by sense.
In addition to examples by sense, there could be a section, also by sense, supported by the learners community to include . For instance, mnemonic sentences and or images that may help while learning a word. This mnemonics could be rated by users, so that the most voted are displayed. Here there may be an option to set the mother language, since a mnemonic could be very good in your mother language is French but have not sense for someone that speaks Japanese.
With this data, some sister projects may be developed as learning tools, like learning cards, improved from what is already available outside wiki, using not only repetition learning, but good mnemonics. Ajoposor (talk) 15:08, 28 April 2018 (UTC)
- I like the ideas a lot. Regarding the sense frequency, we would need some source to get this data from - is there something like this? Denny (talk) 03:55, 29 April 2018 (UTC)
- I'm not sure I follow entirel, but this is not data that is possible to record "about a word" in isolation. Frequencies always apply to a specific corpus, e.g. prose, newspapers, technical writing, online forums…, not to a language as a whole. It therefore seems inappropriate for Wikidata. I agree though that the Wiktionary coverage could be improved a lot, but that's something to take up at the individual Wiktionaries you're interested in. Tropylium (talk) 15:43, 29 April 2018 (UTC)
- Hi, yes, a frequency is specific to a corpus, but it doesn't mean we must discard its use. We may need to agree on a Corpus, for instance, the whole wikipedia could be used as a proxy. There are some dictionaries that already display a frequency. So frequency counting could be done with algorithms so that frequencies may be adjusted over time.
- But going beyond the usual frequency, I would suggest having a methodology so that frequencies be assigned to each meaning (there are many words with multiple meanings, some of them rarely used, an that poses a problem to language learners). In order to do so, there could be an algorithm that takes sample texts containing a word, this sample would be left to users to be analyzed and assigned a meaning numeral to each appearance. In this way, there could be a calculation of frequencies by meaning. It could be made through a gamification of this task.
- I tried to find out who are the users of dictionaries but couldn't find an answer. We may be so used to dictionaries that consider them as a given. But it is important to know who is using the dictionary and what are their needs.
- I'm interested in addressing the needs of language learners. Learning a language is a task that takes A LOT of time. So it is a perfect target for improvement and optimization. Ajoposor (talk) 14:55, 7 May 2018 (UTC)
- I agree to use this frequency data. BoldLuis (talk) 04:30, 9 March 2021 (UTC)
Senses connected to Wikidata Items
[edit]Dear all, I've noticed the editorial note on this page that there's still a need to address how Senses can be related to Wikidata Items without implying synonymy between senses that are related to the same Item. (See quote below.)
- Editorial Note: We should find a good place to address a common source of misunderstandings: Senses can be connected to Wikidata Items via an appropriate Statement they evoke or denote (compare lemon:denotes and lemon:evokes). However, such a connection should not be interpreted as the lexeme actually representing the concept defined by the item (compare lemon:LexicalSense and lemon:LexicalConcept). In particular, if two lexemes have senses that refer to the same concept in this way, this does not imply that the two lexemes are synonyms. Example: The lexemes for the English adjectives "hot" and "cold" could both have a sense that refers to Q11466 (temperature), even though they are antonyms.
This issue has been recognised and addressed in the lemon-tree vocabulary. There, the property http://w3id.org/lemon-tree#isSenseInConcept fits the purpose and has been used for topical thesauri. Perhaps it is worth considering using this approach here, too.
Sander Stolk (talk) 16:58, 6 April 2019 (UTC)
Pronunciation representation and audios
[edit]Newbie question : where can I see them for a word (in a language)? (for the moment, the simpliest question). BoldLuis (talk) 04:32, 9 March 2021 (UTC)
- Hello BoldLuis. For audios of words, best place to starts could be Wikimedia Lingualibre.org's language categories, such as Commons:Category:Lingua_Libre_pronunciation-fra. Then find the (largest) speakers of your target language, and dive in. Lili currently has 400,000+ audios, mostly in 10~20 Western and Indic languages. We do our best to diversify and find new languages. You may also come and ask on Lingualibre, were some users have Sparql experiences and can help you further. Yug (talk) 06:45, 9 March 2021 (UTC)
- Thank you a lot. BoldLuis (talk) 06:55, 9 March 2021 (UTC)
- Also, fyi, Lingualibre's Recording Studio also has an lowly known feature to video record sign languages via the computer's camera. We recorded few FSL (French Sign Languages) signs already. Strongly recommend a clean background when filming. If interested, please ping us on our main forum so we add the Spanish Sign language as well which will allow people to video record them. Yug (talk) 11:03, 9 March 2021 (UTC)
- Good. Thanks. I have seen w:Draft:Lingua Libre. Hope see soon the article in English Wikipedia.BoldLuis (talk) 18:15, 11 March 2021 (UTC)
red JSON link in 'Wikibase data model in JSON '
[edit]Editorial Note: The page does not exist. Please document this topic. Thanks.