Help:MediaSearch

From mediawiki.org
Jump to navigation Jump to search

Special:MediaSearch is a new search back and front end for finding files on Commons, where images appear in a shelf-like layout used by web image search engines. Feedback for MediaSearch can be left at the talk page on Commons.

To maximize the chances of files being found with Special:MediaSearch:

  • Add a relevant and descriptive title
  • Add relevant captions in as many languages as possible, describing what the file is about
  • Add a detailed description, describing what the file is about and any other relevant context

Below is an overview of the kind of data that is used, and in what way it contributes to finding files. There are two main types of data used for finding files:

  1. Full text
  2. Statements and structured data

Full text search[edit]

How[edit]

This is traditional text-based search: if text contains the words being searched for, the file matches.

The ranking is influenced in two ways:

  • Frequency of terms
  • Position of terms
Frequency of terms

The search algorithm will try to estimate how relevant a result is based on the frequency of the search terms.

The more often the search terms occur in a document, the more relevant it appears to be (for example: if one document mentions "Mona Lisa" more than another, it's likely more relevant).

The more often the search term occurs in all documents, the less relevant that term will be (for example: common words like "does" will not contribute much to the score because so many documents have that word).

For a "Mona Lisa" search term in wikitext on the English-language Wikipedia, this helps us discover that the "Mona Lisa" article (184 mentions of the term) is likely a better result than the "Louvre museum" article(7 occurrences.)

The problem that applies to Commons, however, is that this frequency often doesn't mean as much when it comes to comparing relevance: these are not long articles, but short descriptions. Terms tend to occur not more than once or twice and there is little other content to compare it against. That is why we also incorporate the position of terms into the ranking.

Position of terms

There are multiple ways to input information about a file. They each contribute to the final relevance score, but in a different way.

Wikitext descriptions are historically considered most important in presenting file information, but they sometimes contain so much information that significant terms often don't stand out as much when it comes to search relevance. Alternatively, they sometimes contain very little information, which gives search little to work with to determine relevance.

For example, details like the author, the place or date that a media file was created, what museum it belongs to, or what license it is published under–while important–are often not the terms that people will search for. Furthermore, significant parts of a description are often "contextual" information, not pertaining directly to the main subject.

Also, while descriptions often contain a lot of information that can be very important in order to find the file, it can be hard to make out exactly what the file is about based on the terms in the description alone. Descriptions can be long (and even contain multiple languages and information that’s irrelevant to the search term). In other words, it is hard to determine relevance with descriptions.

Additional data that describes things in a more succinct way (such as titles, captions, categories) is often focused on highly specific information, which helps determine what's important in a media file–in other words, this data makes determining relevance easier. This is why the position of terms is important.

For example: when searching for "Mona Lisa," a file that contains "Mona Lisa" in the description alone will usually be ranked lower in search results' than one that also includes that term as part of the title and/or caption, and/or is added to (one of) the Mona Lisa categories.

However, note that duplicating information across fields in wikitext also may have the unintended consequence of lowering frequency-based relevance scores - so be sure to accurately describe the file by adding a relevant title, a detailed description, a caption (ideally in multiple languages), and the appropriate categories, without repeating the same information in multiple places.

Caveats[edit]

The aforementioned full-text search algorithm is very good, but has some issues as well - especially in our context:

Language

In a traditional text-based search, users likely don't want to see results in other languages than the one they are searching in (the assumption is that the user wouldn't understand other languages). That's different on Commons, because people are not really looking for the descriptions –they want the file.

So if a user searches for pictures of cars, ideally search would also find and return files that match in other languages, such as auto in Dutch or voiture" in French. But unless every image's descriptions and/or captions have translations for every language, text-based search will not find results in other languages.

An additional issue here is that while some words look the same in multiple languages, they may have different meanings. For example "gift" in English versus German, or "chat" in English as compared to French; these differences in language will return wildly different results in text-based search due to the change in meaning.

Synonyms

Similarly, when searching for a bat in text-based search, search will not find images where they're referred to by their scientific name: Chiroptera. This would also apply to acronyms, such as NYC when searching for New York City.

Word matches, not concepts

Similarly, a text description might contain a lot more implicit information that simply cannot be captured by scanning wikitext.

A British shorthair is also a cat and a Volvo V40 is a car, but unless their descriptions also explicitly mention cat or car, they won't be found under those terms in a traditional text-based search.

Statements and structured data[edit]

Wikidata statements have the potential of solving many of the aforementioned caveats of traditional text-based searches: they are multilingual, have aliases, and are linked to all sorts of related concepts.

How[edit]

Since the addition of the "Structured data" tab on file pages, it has been possible to attach Wikidata entities to a file, including statements about what the file "depicts."

Given a search term (like "anaconda"), we'll also search Wikidata for relevant entities. In this case, here are some of the top results:

  • Anaconda (Q483539): town in Montana
  • Eunectes (Q188622): genus of snakes

In addition to full text matching, search will also include results that have a "depicts" statement of (one or multiple of) these entities. It will also include results that have a "digital representation of" statement, used for artwork.

This has the potential of drastically expanding the amount of results returned, because entities already cover synonyms (via Wikidata aliases) and language differences (via labels & aliases in multiple languages): a file only needs to be tagged with one depicts statement per item, and search will be able to find that statement and any of its aliases or translations.

And when translations or aliases get added to those entities later on, files tagged with them will automatically benefit from it by now being discoverable under those terms as well. This is why it’s important to continue to enrich the entities added to depicts statements on Commons with more aliases, labels, and other information on Wikidata.

Note: not all entities are considered equally in search ranking. When searching for "iris", users are likely expecting to find multimedia that depicts the genus of plants (Q156901), or maybe the part of an eye (Q178748), but probably not Iris Murdoch, the British writer and philosopher (Q217495).

Based on the similarity to the search term and the importance/popularity of the entity, Media Search will boost multimedia with certain entities more than others.

Caveats[edit]

Wikidata entities are an excellent signal to help discover additional relevant multimedia:

  • there is less noise (e.g. text descriptions often contain false-positives like "iris" being the first name of the photographer, not the subject of the file).
  • they contain a lot more information (aliases & translations) than individual file descriptions ever can.
  • they can be enriched in one central location (Wikidata)

But they are also a poor indicator for relative ranking:

  • In a file with multiple depicts statements, it's hard to know which statements are the most important or relevant
  • Wikidata has many entities at varying levels of detail
Relative ranking

In a file with multiple depicts statements, it's hard to know which statements are the most important or relevant.

Are both equally important, or is one of them the obvious subject and the other a less relevant background detail? If so, which? Is a depicts statement on one file more prominent than the same depicts statement on another?

Consider the "Pale Blue Dot" photographs: even though the earth makes up less than a pixel in the image set, it's a significant feature of the images.

Statements essentially only have two states: something is in the file, or it is not. There is no further detail about just how relevant something is in that file.

The “mark as prominent” feature for statements is provided to address some of these issues, but it is not currently being used consistently. Additionally, the use of qualifiers like 'applies to part' could help improve ranking, but those qualifiers are currently rarely used at all on Commons, though they have precedent on Wikidata. For example, on the Wikidata item for Mona Lisa, the depicted elements have 'applies to part' qualifiers that specify foreground or background, which could provide additional signals to the search ranking algorithm if used on Commons.

While depicts statements are tremendously useful in helping surface additional relevant results, it's hard to use them as a ranking signal: textual descriptions often convey the relative importance of subjects better than these simple statements can.

Level of detail

Wikidata has many entities at varying levels of detail. While we are currently working towards being able to include "child concepts" in search results, it’s important to be careful in the weight we give to certain entities, especially when compared to full text search.

For example, the statements bridge (Q12280), suspension bridge (Q12570), Golden Gate Bridge (Q44440) or tourist attraction (Q570116) could probably all be used to describe a picture of the Golden Gate Bridge, but the Golden Gate Bridge (Q44440) statement already implies all of the others via its various related entities.

However, there are examples where it's not this simple.

German Shepherd dog (Q38280) is a subclass of dog (Q144), which is a subclass of pet (Q39201) - in theory, we should be able to find pictures tagged with "German Shepherd dog" when one searches for "pet."

However, some photos tagged as "German Shepherd dog" likely depict working dogs (Q1806324), not pets.