Topic on Extension talk:CirrusSearch

Suggestion:Expose all useful file metadata to search engine

7
197.218.80.182 (talkcontribs)

While there have been great advancements in the image metadata (e.g. filewidth:, filetype, etc) that has been exposed, it is still lacking some very useful metatada.

For example, one can't search for :

Video or audio of a certain "playtime"

{
      "name": "playtime_seconds",
      "value": 113.72532879819
},

https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:%22FZ_Side_E_-_Trail%22_by_Disasterpeace.wav

Framecount, looped images, duration

                        {    
                             "name": "frameCount",
                              "value": 16
                            },
                            {
                                "name": "looped",
                                "value": true
                            },
                            {
                                "name": "duration",
                                "value": 15
                            }

https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:Missing_square_edit.gif

Frame rate, and creation date

 "bandwidth": 204608,
 "framerate": 15

https://www.mediawiki.org/w/api.php?action=query&titles=File:Folgers.ogv&prop=videoinfo&viprop=derivatives

https://www.mediawiki.org/w/api.php?action=query&titles=File:Folgers.ogv&prop=imageinfo&iiprop=metadata

In some cases the location of the image may be very relevant (if stored in its exif data), and this is also stored in the metadata of some files.

The usecases are numerous, for instance for writing an article about world war one, one may want to filter images from that period. When looking for videos to add to a page one may want short animations to showcase the concept, e.g. a moving hurricane , and not be interested in very long videos. The same applies to animated images because in some cases they illustrate the concept better than others, and in some cases they don't, so it might be good to filter those either way.

Generally it might be good to evaluate what the API exposes, and to surface the most useful metadata.

197.218.80.182 (talkcontribs)

It might also be a good idea to expose metadata from the commons metadata api, it includes interesting data such as :

  • GPSLatitude - latitude
  • GPSLongitude - longitude
  • LicenseShortName - short human-readable license name
  • LicenseUrl
  • DateTimeOriginal

Extension:CommonsMetadata

197.218.80.182 (talkcontribs)
CKoerner (WMF) (talkcontribs)

I created a task for your specific request. This might overlap with the Structured data on Commons project, but it might not! Hopefully we can get some of the engineers/managers to take look at it.

(Bonus points for using a Disasterpiece clip in your example. I'm a big fan). :)

EBernhardson (WMF) (talkcontribs)

We can certainly expand the amount of metadata included limitations of the supporting server we use (elasticsearch) prevents us from including all the arbitrary metadata that is possible. Thanks for bringing up a few specific pieces of metadata that are usefull. Anyone wishing to expand on the list of explicit metadata to be included on the ticket is welcome.

197.218.81.64 (talkcontribs)

> limitations of the supporting server we use (elasticsearch) prevents us from including all the arbitrary metadata that is possible

Sure, that's why it mentions "all useful". Adding the whole dump of metadata isn't useful, as some metadata regular readers wouldn't need even if exposed , and may mostly benefit editors.

To give a bit more supporting evidence, there already tools waiting for its availability for years see:

https://phabricator.wikimedia.org/T51662

You can gauge its usefulness by evaluating these sites:

Search parameter flickr Pixabay Google Youtube IA* Cirrussearch
Resolution Yes Yes Yes No No Yes
Category / tag Yes Yes No Yes Yes Yes
License Yes N / A Yes No Yes No
Location No No No No Yes No
Date taken (created) Yes No No No Yes No
Upload date Yes No Yes Yes Yes No
Color Yes Yes Yes No No No
Author No No No No Yes No
Uploader No No No No Yes No
File type Yes Yes Yes Yes Yes Yes
Orientation Yes Yes No No No No
Duration No No No Yes Yes No
Sort by upload date Yes No No Yes Yes No

IA* = Internet archive

Sample rate might be useful for editors, as requested by the VisualEditor developers, but clearly, despite having more raw metadata wikis are behind the other popular sites, and license specifically is very important and missing from the search.

Clearly internet archive is the winner hands down, despite probably having less resources than some of the other entities in that list. Its tooling is something to strive for.

CKoerner (WMF) (talkcontribs)

This is hands-down the most well-formatted feature request I've seen. You even did a comparison between other sites and formatted that in a table. Bravo and thank you for taking the time to do so.

Reply to "Suggestion:Expose all useful file metadata to search engine"