User:Tpt/WikibaseMediaInfo RDF Dump Format

From MediaWiki.org
Jump to navigation Jump to search

This is an experiment on how to represent MediaInfo Wikibase entities in RDF.

It is an extension of the Wikibase RDF model. It proposes to reuse schema.org vocabulary as much as possible. Schema.org is already intensively used in Wikibase RDF representation and provides an important set of properties and type for media contents. This proposal also aims at being consistent with the Wikibase Lexeme RDF model.

Basic representation[edit]

This section proposes a "basic" representation of the MediaInfo entities, aiming at providing a full mapping of the entity data but without information derived from other sources (MediaWiki file metadata...).

Example:

wd:M222222 a wikibase:MediaInfo , schema:MediaObject ;
     # caption
     schema:caption "a boat"@en ;
     rdfs:label "a boat"@en ;

     # statements
     wdt:P2 wd:Q3 ;
     wdt:P7 "value1" , "value2" ;
     p:P2 wds:Q3-4cc1f2d1-490e-c9c7-4560-46c3cce05bb7 ;
     p:P7 wds:Q3-24bf3704-4c5d-083a-9b59-1881f82b6b37 ,
          wds:Q3-45abf5ca-4ebf-eb52-ca26-811152eb067c .

Comments:

Classes
The media info concept of Wikibase aligns well with schema:MediaObject. Having a class wikibase:MediaInfo would be convenient for consistency with the other entity types wikibase:Item, wikibase:Lexeme... It would be meaningful to have wikibase:MediaInfo rdf:subClassOf schema:MediaObjet in the ontology definition.
Caption
The closest schema.org relation is schema:caption that has the advantage of having the same name as the Wikibase feature and being specific to media content. It would allow to write SPARQL queries looking for media file based on their caption without bothering of filtering out e.g. Wikidata items. It is also interesting to add rdfs:label to the RDF output (but probably not the the query service) for interoperability, similarly to what have been done for lexemes.
Statements
For consistency and simplicity we could use the same schema as the other entity types.

Extended representation[edit]

This section proposes to extend the basic representation with other metadata already stored in the MediaWiki database to enable more SPARQL queries. Some of the properties proposed here only apply to some file types and should not appear on the other files.

Example (all properties are displayed here even if some would never appear together like schema:numberOfPages and schema:duration):

wd:M222222 a wikibase:MediaInfo , schema:MediaObject , schema:VideoObject ;
     # basic file metadata
     schema:contentUrl <https://upload.wikimedia.org/wikipedia/commons/f/f7/Boat_movie.webm> ; # URL to the file itself
     schema:encodingFormat "video/webm" ; # File mime type
     schema:contentSize 123445 ; # File size in bytes
     schema:height 1024 ; # Image/video height in px
     schema:width 2048 ; # Image/video width in px
     schema:duration "PT123S"^^xsd:duration ; # Video duration
     schema:numberOfPages 12 ; # Number of pages in a multi-pages document

 
     # caption
     schema:caption "a boat sailing"@en ;
     rdfs:label "a boat sailing"@en ;

     # statements
     wdt:P2 wd:Q3 ;
     wdt:P7 "value1" , "value2" ;
     p:P2 wds:Q3-4cc1f2d1-490e-c9c7-4560-46c3cce05bb7 ;
     p:P7 wds:Q3-24bf3704-4c5d-083a-9b59-1881f82b6b37 ,
          wds:Q3-45abf5ca-4ebf-eb52-ca26-811152eb067c .

<https://commons.wikimedia.org/wiki/File:Boat_movie.webm> a schema:Article ;
     schema:about wd:M222222 ;
     schema:isPartOf <https://commons.wikimedia.org/> .
Classes
In addition to the schema:MediaObject and wikibase:MediaObject classes we could add the classes schema:AudioObject, schema:ImageObject, schema:VideoObject to allow easy querying of only images, audios or video. These classes would be assigned based on the mediaWiki media type returned by File::getMediaType().
schema:contentUrl
would provide the direct canonical of the file itself. Could be provided by File::getFullUrl().
schema:encodingFormat
would provide the MIME type of the file to be able to only query files of a given mime type, do statistics based on it... Could be provided by File::getMimeType().
schema:contentSize
would provide the size of the file in bytes. Would be interesting to allow statistics on the file size joined with data stored in statements (e.g. size of all the uploads from a given partnership...). Could be provided by File::getSize().
schema:height and schema:width
would provide the height and width of the file if it is an image or a video. Could be provided by File::getHeight() and File::getWidth().
schema:duration
would provide the duration of a video. Could be provided by File::getLength(). We need to choose if we use the xsd:duration datatype as suggested by schema.org or just use an integer containing the number of second.
schema:numberOfPages
would provide the number of pages of a multi-pages file. Could be provided by File::pageCount(). It is a slight abuse to use this property here, in schema.org it is supposed to be used on schema:Book.