Extension talk:WikibaseMediaInfo/RDF mapping

Jump to navigation Jump to search

About this board

Mapping from filenames to MediaObjects

2
Dipsacus fullonum (talkcontribs)

The example

<https://commons.wikimedia.org/wiki/File:Boat_movie.webm> a schema:Article ;

schema:about sdoc:M222222 ;

schema:isPartOf <https://commons.wikimedia.org/> .

should give a way to get from a filename to a MediaObject. But in WCQS there are no triples with the predicate schema:about and no objects of type schema:Article.

Is the description here outdated or is the triples missing in WCQS?

What is the recommended way to get a MediaObject if you have the filename? You can go the opposite way with schema:contentUrl by taking the last part of the URI and decode, but you cannot construct the contentURI from the filename.

Lucas Werkmeister (WMDE) (talkcontribs)

Hm, strange. The only way I found to get the entity is this terrible hack (do not use) to get the page ID via MWAPI:

SELECT * WHERE {
  SERVICE wikibase:mwapi {
    bd:serviceParam wikibase:endpoint "commons.wikimedia.org";
                    wikibase:api "Generator";
                    mwapi:generator "allpages";
                    mwapi:gapfrom "Mapa de Palomino.jpg";
                    mwapi:gapto "Mapa de Palomino.jpg";
                    mwapi:gapnamespace "6".
    ?pageId wikibase:apiOutput "@pageid".
  }
  BIND(IRI(CONCAT(STR(sdc:M), ?pageId)) AS ?entity)
}

Try it!

I think I’ll leave the better answering of this to the WCQS team :)

Reply to "Mapping from filenames to MediaObjects"

Link between file description page and MediaInfo entity

2
GZWDer (talkcontribs)

I don't think we should reuse schema:about to link file description page and MediaInfo entity, as it is already used by sitelinks. In theory file description page may have its item (though it's not allowed in Wikidata), which will make schema:about ambigous.

Tpt (talkcontribs)

That's indeed a good point. I believe there somehow the same relation between file description page and media info entity as between e.g. wikipedia articles and items. So, I don't think it is bad to reuse the same relation but using a more specialized one might make querying easier indeed. Do you have an idea of an other relation from schema.org or elsewhere to use instead?

Reply to "Link between file description page and MediaInfo entity"
Summary by Lucas Werkmeister (WMDE)

our use of schema:caption is slightly more general than the one of schema.org, but this seems acceptable; improvement suggested on GitHub

Lucas Werkmeister (WMDE) (talkcontribs)

While I agree that schema:caption seems like a good predicate to use, the schema.org folks seem to have defined it a bit oddly: it’s not used on the general MediaObject class, but only on the AudioObject, ImageObject and VideoObject classes. A MusicVideoObject or a DataDownload, for example, apparently shouldn’t have a caption.

With the extended representation, I guess that should be fine, because each MediaInfo would be an instance of AudioObject, ImageObject or VideoObject. (Well… what happens for other media types?) But in the basic representation, it’s only an instance of MediaObject… is that an issue?

Marsupium (talkcontribs)

Before this ticket schema:caption was only for schema:VideoObject apparently. Maybe it could be expanded further?

Tpt (talkcontribs)

Hey @Lucas Werkmeister (WMDE):. Sorry for the late answer. Thank you very much for raising this point. I believe that schema.org allows to extend the scope of properties. I also proposed to use on multi-pages files schema:numberOfPages that is supposed to be used on Books.

There are some media types where there seems to be no good MediaObject sub classes, for examples PDFs or DjVus. I believe that in this case the easiest thing to do, at least for now, is using only the MediaInfo class. What do you think about it?

I have opened a ticket on schema.org GitHub about expending schema:caption scope.

Lucas Werkmeister (WMDE) (talkcontribs)

That makes sense to me, thanks for your answer :)

Smalyshev (WMF) (talkcontribs)

Looks good for me, one thing that seems to be missing in the link to the file URL - either wiki page or actual media file (or maybe both?).

Smalyshev (WMF) (talkcontribs)

Oops, missed extended part - that looks good, I just wonder if we want the wiki page too - it probably can be derived from content URL (or not?) but may be useful to have it explicitly maybe?

Tpt (talkcontribs)

Thank you for having taken a look at it. Indeed having the wikipage would be great. What about using the same structure as item sitelinks?

<https://commons.wikimedia.org/wiki/File:Wikidata_time-latitude_visualization_-_2016-10-24.png> a schema:Article ;
     schema:about wd:M22222 ;
     schema:isPartOf <https://commons.wikimedia.org/> ;
     schema:name "Wikidata time-latitude visualization - 2016-10-24.png"@und .

I also missed some important points:

  1. We need an URI prefix for media info entities URIs. We should probably use the namespace http://commons.wikimedia.org/entity/ but we need a name for the prefix.
  2. It would be nice to have a relation between the MediaInfo entity and the file name. schema:name seems to be the right schema.org property for that but it's already used for item labels that are more similar to MediaInfo captions.
  3. In my proposal, neither the entity URI, the wikipage URI or the URI of the file itself is the one used for the commons media datatype, making very hard to e.g. do a query on both MediaInfo and the Wikidata "image" property. We could change the URI of the file in my proposal to be the one used by the commons media datatype but it is not very nice because it is not the actual final file URL (the URI the commons media uses redirects to it). I believe it would be better to introduced a normalized value for the commons media datatype that would give the URI of the MediaInfo entity. Generating it would require a SQL query to the images metadata table. Tpt (talk) 12:33, 8 March 2019 (UTC)
Smalyshev (WMF) (talkcontribs)

Not sure we need schema:name - does it add anything really? It's basically repeating the URL.

  1. Note that /entity/ URL requires some redirect setup too - look how it works on Wikidata. But good point that Commons entities live on Commons, so they can't use WIkidata URLs.
  2. I think schema:about already links page to URL, which essentially is the file name, so not sure it's necessary. Depends on use cases - as string manipulation slows things down, it ultimately depends on queries we're going to need. Also note that filenames, unlike most article names, tend to be long, so duplicating them would have non-negligible performance costs (we have 50M of them!).
  3. That's an excellent point, we probably want to harmonize this one way or another. Ideally, sitelink URL and commons media URL should be the same.
Tpt (talkcontribs)

Thank you for your feedback!

Indeed schema:name does not add any extra information. But I believe it is nice to have it on the sitelink description or the file description because it is highly likely that externals tools using Commons content rely on it. What we could do, is to not implement it in the first version and see if there are requests for it. What do you think about it?

  1. Indeed. Thanks!
  2. Ok. At least there is no point to have it both in the sitelink node and in the MediaInfo node. Do you have a preference for the prefix to update the example?
  3. commons media URL currently points to a special page that redirects to the file itself (the target of schema:contentUrl in my proposal). So it would be a big breaking change. If you plan to do that I would tend to prefer to point to the MediaInfo entity because it's the root of the structured data description of the file and so, enables more interesting queries without needing an extra triple pattern in the query. But, with the current way data is stored, it would make the RDF dump generation more costly so it's maybe not a good idea.
  4. An other not important point. What do you think the datatype of the value of schema:duration should be? xsd:integer or xsd:duration?
Reply to "File URL?"
Jheald (talkcontribs)

This looks like a very useful piece of work.

One thought that occurred to me -- should we make any provision for information about previous image revisions?

Some applications (eg georefencing) may have produced data related to a particular revision of the image. Is it worth storing any information about the old image (eg dimensions), or about any best-effort transformation to map its content to the current image or the reverse?

Smalyshev (WMF) (talkcontribs)

I am not sure it makes sense to mix historical data with current data. I think for now it makes sense to focus on representing the current state, and deal with history later, if the need arrives.

Tpt (talkcontribs)

Thank you for your review! Indeed it would useful to allow to query data from previous revisions but it is not something specific to image metadata. It would be nice to be able to query all data from previous revisions. I have made a first attempt at it for Wikidata item contents. This system does not care about how actual content is represented (items, properties, mediainfo...) as soon as it is RDF triples from a MediaWiki revision. So, generalizing it, or an other approach, to Commons should not be hard and cover the use case you are mentioning.

Reply to "Old image revisions?"
There are no older topics