Extension:WikibaseLexeme/Data Model

This is a living document, describing the conceptual data model used by WikibaseLexeme. It is not a specification of any concrete binding, implementation, mapping, or serialization. The Wikibase Lexeme extension provides improved modeling for lexical entities such as words and phrases. While it would be theoretically possible to model these things using Items, a more expressive specialized model helps to reduce complexity, and improve re-use and mappings to other vocabularies.
 * Lexeme:
 * Lemma
 * Language
 * Lexical category
 * Statements
 * Forms:
 * Representation
 * Syntactic Markers
 * Statements
 * Senses:
 * Gloss
 * Statements

The data model of WikibaseLexeme describes the structure of the data that is handled as Lexemes in Wikibase. In particular, it specifies which kind of information users can contribute to the system. The data model is conceptual ("Which information do we have to support?") and does not specify how this data should be represented technically ("Which data structures should the software use?") or syntactically ("How should the data be expressed in a file?"). Separate documents describe the serialization of the Wikibase data model in JSON and in RDF (Resource Description Framework).

The Lexeme data model defines an ontology for lexicographical entities. In particular, it defines a vocabulary of entities and relationships (classes and predicates) for describing lexemes. The Lexeme data model is based on the Wikibase data model, so the Wikidata glossary and the Wikibase data model primer may be helpful in understanding this document. The Lexeme data model aims to align with the LEMON model by the Ontolex W3C community group, where useful and practical. However, in the spirit of Wikibase, the Lexeme model is designed to be simple and flexible enough for casual collaborative editing, as opposed to the more formalized approach taken by LEMON.

Lexeme
A Lexeme is a lexical element of a natural language, such as a word, a phrase, or a prefix (see Lexeme on Wikipedia). Lexemes are Entities in the sense of the Wikibase data model.

A Lexeme is described using the following information:


 * An ID. Lexemes have IDs starting with an "L" followed by a natural number in decimal notation, e.g. . These IDs are unique within the repository that manages the Lexeme.
 * A Lemma for use as a human readable representation of the lexeme, e.g. "run".
 * The  Language  to which the lexeme belongs. This is a reference to a concrete Item on Wikidata, e.g. Q1860 for English.
 * The Lexical category  to which the lexeme belongs. This is given as a reference to a concrete Item on Wikidata, e.g. Q34698 for adjective.
 * A list of Statements to describe properties of the lexeme that are not specific to a Form or Sense (e.g. derived from or grammatical gender or syntactic function)
 * A list of Forms, typically one for each relevant combination of syntactic markers, such as 2nd person / singular / past tense.
 * A list of Senses, describing the different meanings of the lexeme (e.g. "financial institution" and "edge of a body of water" for the English noun bank).

TBD: we may have to add a field for grammatical gender here. it's probably sufficient to model that via a statement, though.

TBD: use just "category" instead of "lexical category"? Can we use "word class"? Or "part of speech"?

Lemma
The lemma is a human readable representation of the lexeme (see Lemma on Wikipedia). Typically, the canonical form of the lexeme (e.g. the infinitive form of verbs) will be used as the lemma (see also lemon:canonicalForm).

Lemmas are not simple strings, but MultilingualTextValues, since the same lemma may have multiple spellings.

For example, the Lemma for English noun color would include "colour" for British English as well as "color" for American English. This is specially important for languages that use multiple alphabets, such as Serbian.

Note: Lemmas are not unique, nor is the combination of Lemma, Language, and Lexical category. For example, there are two German nouns with the Lemma "See", differing only in gender: "der See" meaning "the lake", and "die See" meaning "the sea". These two meanings cannot be understood as a single Lexeme, since based on their gender they differ in morphology, and they thus have different forms.

Form
The morphology of the lexeme is understood as a set of Forms. Each form defines how a lexeme changes based on a specific syntactical role or mode it may take in a sentence (see also lemon:Form). For example, the English verb run becomes "running" as a gerund and "runs" in 3rd person singular.

A Form is described using the following information:


 * A representation, spelling out the Form as a string.
 * A list of syntactical markers that define for which syntactical role the given form applies. These are given as references to a concrete Items on Wikidata, e.g. Q814722 for participle.
 * A list of Statements further describing the Form or its relations to other Forms or Items (e.g. pronunciation audio, rhymes with, used until, used in region)

Representation
A form's representation is its written form, as used in a text (compare lemon:writtenRep). Just like Lemmas, Representation are not simple strings, but MultilingualTextValues, since the same form may have multiple spellings, possibly in multiple scripts.

Syntactic Marker
TBD: The term "Syntactic Marker" has been removed from LEMON; The "marker" predicate seems to relate to the syntactic frame for a sense, rather than specifying when and where a specific form applies. The lexinfo ontology provides the "markers" we want here, such as case, number, tense, or gender, via "morphosyntactic properties". According to wikipedia, a good name for this could by (syntactic) feature. Grammatical category may also fit, but may be too general, and easily confused with the lexical category.

A form's syntactical markers together define the syntactical role or mode in which a form applies (see lexinfo:morphosyntacticProperty). If each syntactical marker can be thought of as defining a set of forms that it applies to, then the of forms associated with the role is the intersection of the sets of forms defined by the markers given.

For example, the role 1st person present tense plural can be defined by three markers, represented by Wikidata Items: Q192613 (present tense), Q5397000 (first person), and Q146786 (plural). The set of forms for that role is the intersection of the sets of forms that have the respective markers.

Sense
The senses of a lexeme are the different meanings which it may represent in a text. The senses are given as natural language definitions or glosses (compare intensional definitions on Wikipedia).

A sense is described using the following information:
 * A Gloss, defining the meaning of the Sense using natural language.
 * A list of Statements further describing the Sense and its relations to Senses and Items (e.g. translation, synonym, antonym, connotation, register, refers to concept).

Note that senses can be connected to Wikidata Items via an appropriate Statement, but such a connection should not be interpreted as the lexeme actually representing the concept defined by the item (compare lemon:LexicalSense and lemon:LexicalConcept). In particular, if two lexemes hav senses that refer to the same concept in this way, this does not imply that the two lexemes are synonyms. For example, the lexemes for the English adjectives "hot" and "cold" would both have a sense that refers to Q11466 (temperature), even though they are antonyms.


 * TBD: we may have to add a field for a function description (what's the definition of "to"? it doesn't have a definition, just a function); alternatively, we could treate functions as a third type of nested entity, on the same level as senses.
 * TBD: we may have to add a field for the syntactic frame ("ask for", "ask about", "ask to", "ask out", "ask oneself", ...). See synsem:marker and synsem:syntactic-frame.

Gloss
A sense's gloss gives a natural definition of the sense (see Gloss on Wikipedia and skos:definition). Similar to Lemmas, Glosses are not simple strings, but MultilingualTextValues. However, the the reason is not providing support for variants, but to allow the gloss to be given in entirely different languages. E.g. it would be quite useful for a German learning French to have a German gloss for a French word.