メディア検索 MediaSearch とはコモンズに導入された新規の検索用フロントエンドおよびバックエンドで、ウェブの画像検索エンジンのレイアウト同様、陳列棚のように画像を表示します。メディア検索 MediaSearch に関するご意見ご感想は コモンズのトークページに投稿してください。
The more often the search term occurs in all documents, the less relevant that term will be (for example: common words like "does" will not contribute much to the score because so many documents have that word).
英語版ウィキペディアでウィキテキストとして「Mona Lisa」で検索すると、「Louvre museum」の記事（ルーブル美術館、7回ヒット）よりも「Mona Lisa」の記事（184回ヒット）のほうがおそらくは結果として優れていると発見する手がかりになります。
There are multiple ways to input information about a file. They each contribute to the final relevance score, but in a different way.
Wikitext descriptions are historically considered most important in presenting file information, but they sometimes contain so much information that significant terms often don't stand out as much when it comes to search relevance. Alternatively, they sometimes contain very little information, which gives search little to work with to determine relevance.
For example, details like the author, the place or date that a media file was created, what museum it belongs to, or what license it is published under–while important–are often not the terms that people will search for. Furthermore, significant parts of a description are often "contextual" information, not pertaining directly to the main subject.
Also, while descriptions often contain a lot of information that can be very important in order to find the file, it can be hard to make out exactly what the file is about based on the terms in the description alone. Descriptions can be long (and even contain multiple languages and information that’s irrelevant to the search term). In other words, it is hard to determine relevance with descriptions.
Additional data that describes things in a more succinct way (such as titles, captions, categories) is often focused on highly specific information, which helps determine what's important in a media file–in other words, this data makes determining relevance easier. This is why the position of terms is important.
For example: when searching for "Mona Lisa," a file that contains "Mona Lisa" in the description alone will usually be ranked lower in search results' than one that also includes that term as part of the title and/or caption, and/or is added to (one of) the Mona Lisa categories.
However, note that duplicating information across fields in wikitext also may have the unintended consequence of lowering frequency-based relevance scores - so be sure to accurately describe the file by adding a relevant title, a detailed description, a caption (ideally in multiple languages), and the appropriate categories, without repeating the same information in multiple places.
The aforementioned full-text search algorithm is very good, but has some issues as well - especially in our context:
In a traditional text-based search, users likely don't want to see results in other languages than the one they are searching in (the assumption is that the user wouldn't understand other languages). That's different on Commons, because people are not really looking for the descriptions –they want the file.
So if a user searches for pictures of cars, ideally search would also find and return files that match in other languages, such as auto in Dutch or voiture" in French. But unless every image's descriptions and/or captions have translations for every language, text-based search will not find results in other languages.
An additional issue here is that while some words look the same in multiple languages, they may have different meanings. For example "gift" in English versus German, or "chat" in English as compared to French; these differences in language will return wildly different results in text-based search due to the change in meaning.
Similarly, when searching for a bat in text-based search, search will not find images where they're referred to by their scientific name: Chiroptera. This would also apply to acronyms, such as NYC when searching for New York City.
- Word matches, not concepts
Similarly, a text description might contain a lot more implicit information that simply cannot be captured by scanning wikitext.
A British shorthair is also a cat and a Volvo V40 is a car, but unless their descriptions also explicitly mention cat or car, they won't be found under those terms in a traditional text-based search.
Statements and structured data
Wikidata statements have the potential of solving many of the aforementioned caveats of traditional text-based searches: they are multilingual, have aliases, and are linked to all sorts of related concepts.
Given a search term (like "anaconda"), we'll also search Wikidata for relevant entities. In this case, here are some of the top results:
- Anaconda (Q483539): town in Montana
- Eunectes (Q188622): genus of snakes
- "Anaconda" (Q17485058): Nicki Minaj song
This has the potential of drastically expanding the amount of results returned, because entities already cover synonyms (via Wikidata aliases) and language differences (via labels & aliases in multiple languages): a file only needs to be tagged with one depicts statement per item, and search will be able to find that statement and any of its aliases or translations.
And when translations or aliases get added to those entities later on, files tagged with them will automatically benefit from it by now being discoverable under those terms as well. This is why it’s important to continue to enrich the entities added to depicts statements on Commons with more aliases, labels, and other information on Wikidata.
Note: not all entities are considered equally in search ranking. When searching for "iris", users are likely expecting to find multimedia that depicts the genus of plants (Q156901), or maybe the part of an eye (Q178748), but probably not Iris Murdoch, the British writer and philosopher (Q217495).
Based on the similarity to the search term and the importance/popularity of the entity, Media Search will boost multimedia with certain entities more than others.
Wikidata entities are an excellent signal to help discover additional relevant multimedia:
- there is less noise (e.g. text descriptions often contain false-positives like "iris" being the first name of the photographer, not the subject of the file).
- they contain a lot more information (aliases & translations) than individual file descriptions ever can.
- they can be enriched in one central location (Wikidata)
But they are also a poor indicator for relative ranking:
- In a file with multiple depicts statements, it's hard to know which statements are the most important or relevant
- Wikidata has many entities at varying levels of detail
- Relative ranking
In a file with multiple depicts statements, it's hard to know which statements are the most important or relevant.
Are both equally important, or is one of them the obvious subject and the other a less relevant background detail? If so, which? Is a depicts statement on one file more prominent than the same depicts statement on another?
Consider the "Pale Blue Dot" photographs: even though the earth makes up less than a pixel in the image set, it's a significant feature of the images.
Statements essentially only have two states: something is in the file, or it is not. There is no further detail about just how relevant something is in that file.
The “mark as prominent” feature for statements is provided to address some of these issues, but it is not currently being used consistently. Additionally, the use of qualifiers like 'applies to part' could help improve ranking, but those qualifiers are currently rarely used at all on Commons, though they have precedent on Wikidata. For example, on the Wikidata item for Mona Lisa, the depicted elements have 'applies to part' qualifiers that specify foreground or background, which could provide additional signals to the search ranking algorithm if used on Commons.
While depicts statements are tremendously useful in helping surface additional relevant results, it's hard to use them as a ranking signal: textual descriptions often convey the relative importance of subjects better than these simple statements can.
犬種のジャーマンシェパード（Q38280）はイヌの下位クラス（Q144）すなわちその上位クラスはペット（Q39201） - 理論上は「ペット」の写真を検索すると、ジャーマンシェパードをタグ付けした写真をが検出されるはずです。