Extension:WikibaseLexeme/Data Model

This is a living document, describing the conceptual data model used by WikibaseLexeme. It is not a specification of any concrete binding, implementation, mapping, or serialization.

The data model of WikibaseLexeme describes the structure of the data that is handled as Lexemes in Wikibase. In particular, it specifies which kind of information users can contribute to the system. The data model is conceptual ("Which information do we have to support?") and does not specify how this data should be represented technically ("Which data structures should the software use?") or syntactically ("How should the data be expressed in a file?"). Separate documents describe the serialization of the Wikibase data model in JSON and in RDF (Resource Description Framework).

The Lexeme data model is based on the Wikibase data model. The Wikidata glossary and the Wikibase data model primer may be helpful in understanding this document. The Lexeme data model aims to align with the LEMON model where useful and practical. However, in the spirit of Wikibase, the Lexeme model is designed to be simple and flexible enough for casual collaborative editing, as opposed to the more formalized approach taken by LEMON.

Motivation
The Wikibase Lexeme extension provides improved modeling for lexical entities such as words and phrases. While it would be theoretically possible to model these things using Items, a more expressive specialized model helps to reduce complexity, and improve re-use and mappings to other vocabularies.

Overview
TBD

Lexeme
A Lexeme is an Entity that represents a lexical element of a natural language, such as a word, a phrase, or a prefix (see Lexeme on Wikipedia). It consists of the following components:


 * An ID. Lexemes have IDs starting with an "L" followed by a natural number in decimal notation, e.g. . These IDs are unique within the repository that manages the Lexeme.
 * A Lemma for use as a human readable representation of the lexeme, e.g. "run".
 * The Language to which the lexeme belongs. This is a reference to a concrete Item on Wikidata, e.g. Q1860 for English.
 * The Lexical category to which the lexeme belongs. This is a reference to a concrete Item on Wikidata, e.g. Q34698 for adjective.
 * A set of Forms, typically one for each relevant combination of syntactic markers, such as 2nd person / singular / past tense.
 * A set of Senses, describing the different meanings of the lexeme (e.g. "financial institution" and "edge of a body of water" for the English noun bank).
 * Multiple Statements to describe properties of the lexeme that are not specific to a form or sense (e.g. derived from or grammatical gender)

TBD: we may have to add a field for grammatical gender here. it's probably sufficient to model that via a statement, though.

Lemma
The lemma provides a human readable representation of the lexeme as a whole (see Lemma on Wikipedia). Typically, the canonical form of the lexeme (e.g. the infinitive form of verbs) will be used as the lemma (see also lemon:canonicalForm).

Lemmas are not simple strings, but MultilingualTextValues, since the same lemma may have multiple spellings.

For example, the Lemma for English noun color would contain "colour" for British English as well as "color" for American English. This is specially important for languages that use multiple alphabets, such as Serbian.

Note: Lemmas are not unique, nor is the combination of Lemma, Language, and Lexical category. For example, there are two German nouns with the Lemma "See", differing only in gender: "der See" meaning "the lake", and "die See" meaning "the sea". These two meanings cannot be modeled as the same Lexeme, since based on their gender they differ in morphology, and thus have different forms.

Form

 * 1 Representation (the actual string)
 * Multiple Grammatical markers
 * Multiple Statements (e.g. region, period, pronunciation, etc.)

Sense

 * 1 Gloss per language (=definition)
 * TBD: we may have to add a field for a function description (what's the definition of "to"? it doesn't have a definition, just a function)
 * TBD: we may have to add a field for the syntactic frame ("ask for", "ask about", "ask to", "ask out", "ask oneself", ...)
 * Multiple Statements (e.g. translations, synonyms, connotation, register, usage example, refers-to-concept)