Help:MediaSearch

From mediawiki.org
This page is a translated version of the page Help:MediaSearch and the translation is 42% complete.
Languages:

メディア検索 MediaSearch とはコモンズに導入された新規の検索用フロントエンドおよびバックエンドで、ウェブの画像検索エンジンのレイアウト同様、陳列棚のように画像を表示します。メディア検索 MediaSearch に関するご意見ご感想は コモンズのトークページに投稿してください。

特別:メディア検索を使い特定の画像が見つかる機会を最大化するには:

  • 題名を付けるときは内容を説明する適切な文字列にする
  • 適切なキャプションを、可能な限り多言語で付け、ファイルの内容を説明する
  • 説明は詳細にして、ファイルの主題とその他の実質的な価値のある情報を盛り込む
  • ファイルに関連性のあるカテゴリを付ける
  • 画面を見て読み取れる描写の明細をこまかく加える

以下はどんなデータが使えるか、ファイル検索にどのように役立つか、まとめました。画像検索では主に2種類のデータを採用しています。

  1. 全文
  2. 明細と構造化データ

全文検索

仕組み

これは昔から使われてきた文字による検索です。解説文が検索語を含む場合はそのファイルがヒットします。

検索順位には2つの要素が影響します。

  • 用語の使用頻度
  • 用語の位置
用語の頻度

検索アルゴリズムは検索結果の関連度の判定に検索用語の使用頻度を基準にします。

ある文書で検索語の出現頻度が高いほど、関連度がより高いとみなされます(例:ある文書が他の文書よりも多く「モナリザ」に言及しているならば、関連度は高そう)。

The more often the search term occurs in all documents, the less relevant that term will be (for example: common words like "does" will not contribute much to the score because so many documents have that word).

英語版ウィキペディアでウィキテキストとして「Mona Lisa」で検索すると、「Louvre museum」の記事(ルーブル美術館、7回ヒット)よりも「Mona Lisa」の記事(184回ヒット)のほうがおそらくは結果として優れていると発見する手がかりになります。

ところが、長文の記事ではなく、短い解説文があるコモンズの場合、検索語のヒット率は関連性の比較に必ずしも有効ではないという問題点があります。特定の検索語は1、2回しか使われず、ほかに比較の決め手となる要素もほとんどありません。そのため順位判定には用語の位置も採用しました。

用語の位置

There are multiple ways to input information about a file. They each contribute to the final relevance score, but in a different way.

Wikitext descriptions are historically considered most important in presenting file information, but they sometimes contain so much information that significant terms often don't stand out as much when it comes to search relevance. Alternatively, they sometimes contain very little information, which gives search little to work with to determine relevance.

For example, details like the author, the place or date that a media file was created, what museum it belongs to, or what license it is published under–while important–are often not the terms that people will search for. Furthermore, significant parts of a description are often "contextual" information, not pertaining directly to the main subject.

Also, while descriptions often contain a lot of information that can be very important in order to find the file, it can be hard to make out exactly what the file is about based on the terms in the description alone. Descriptions can be long (and even contain multiple languages and information that’s irrelevant to the search term). In other words, it is hard to determine relevance with descriptions.

Additional data that describes things in a more succinct way (such as titles, captions, categories) is often focused on highly specific information, which helps determine what's important in a media file–in other words, this data makes determining relevance easier. This is why the position of terms is important.

For example: when searching for "Mona Lisa," a file that contains "Mona Lisa" in the description alone will usually be ranked lower in search results' than one that also includes that term as part of the title and/or caption, and/or is added to (one of) the Mona Lisa categories.

However, note that duplicating information across fields in wikitext also may have the unintended consequence of lowering frequency-based relevance scores - so be sure to accurately describe the file by adding a relevant title, a detailed description, a caption (ideally in multiple languages), and the appropriate categories, without repeating the same information in multiple places.

Caveats

The aforementioned full-text search algorithm is very good, but has some issues as well - especially in our context:

言語

In a traditional text-based search, users likely don't want to see results in other languages than the one they are searching in (the assumption is that the user wouldn't understand other languages). That's different on Commons, because people are not really looking for the descriptions –they want the file.

So if a user searches for pictures of cars, ideally search would also find and return files that match in other languages, such as auto in Dutch or voiture" in French. But unless every image's descriptions and/or captions have translations for every language, text-based search will not find results in other languages.

An additional issue here is that while some words look the same in multiple languages, they may have different meanings. For example "gift" in English versus German, or "chat" in English as compared to French; these differences in language will return wildly different results in text-based search due to the change in meaning.

Synonyms

Similarly, when searching for a bat in text-based search, search will not find images where they're referred to by their scientific name: Chiroptera. This would also apply to acronyms, such as NYC when searching for New York City.

Word matches, not concepts

Similarly, a text description might contain a lot more implicit information that simply cannot be captured by scanning wikitext.

A British shorthair is also a cat and a Volvo V40 is a car, but unless their descriptions also explicitly mention cat or car, they won't be found under those terms in a traditional text-based search.

Statements and structured data

Wikidata statements have the potential of solving many of the aforementioned caveats of traditional text-based searches: they are multilingual, have aliases, and are linked to all sorts of related concepts.

How

ファイルページに「構造化データ」タブが追加されたおかげで、そのファイルが何を「描写する」かという明細を含めて、ファイルにウィキデータの実体を付属させることが可能になりました。

Given a search term (like "anaconda"), we'll also search Wikidata for relevant entities. In this case, here are some of the top results:

全文検索に加えて、 検索は実体の「描写」の明細も対象にします(単一でも複数でも)。さらに馬術作品に用いた明細を「デジタルデータで表現」したものも含まれます。

This has the potential of drastically expanding the amount of results returned, because entities already cover synonyms (via Wikidata aliases) and language differences (via labels & aliases in multiple languages): a file only needs to be tagged with one depicts statement per item, and search will be able to find that statement and any of its aliases or translations.

And when translations or aliases get added to those entities later on, files tagged with them will automatically benefit from it by now being discoverable under those terms as well. This is why it’s important to continue to enrich the entities added to depicts statements on Commons with more aliases, labels, and other information on Wikidata.

Note: not all entities are considered equally in search ranking. When searching for "iris", users are likely expecting to find multimedia that depicts the genus of plants (Q156901), or maybe the part of an eye (Q178748), but probably not Iris Murdoch, the British writer and philosopher (Q217495).

Based on the similarity to the search term and the importance/popularity of the entity, Media Search will boost multimedia with certain entities more than others.

Caveats

Wikidata entities are an excellent signal to help discover additional relevant multimedia:

  • there is less noise (e.g. text descriptions often contain false-positives like "iris" being the first name of the photographer, not the subject of the file).
  • they contain a lot more information (aliases & translations) than individual file descriptions ever can.
  • they can be enriched in one central location (Wikidata)

But they are also a poor indicator for relative ranking:

  • In a file with multiple depicts statements, it's hard to know which statements are the most important or relevant
  • Wikidata has many entities at varying levels of detail
Relative ranking

In a file with multiple depicts statements, it's hard to know which statements are the most important or relevant.

Are both equally important, or is one of them the obvious subject and the other a less relevant background detail? If so, which? Is a depicts statement on one file more prominent than the same depicts statement on another?

Consider the "Pale Blue Dot" photographs: even though the earth makes up less than a pixel in the image set, it's a significant feature of the images.

Statements essentially only have two states: something is in the file, or it is not. There is no further detail about just how relevant something is in that file.

The “mark as prominent” feature for statements is provided to address some of these issues, but it is not currently being used consistently. Additionally, the use of qualifiers like 'applies to part' could help improve ranking, but those qualifiers are currently rarely used at all on Commons, though they have precedent on Wikidata. For example, on the Wikidata item for Mona Lisa, the depicted elements have 'applies to part' qualifiers that specify foreground or background, which could provide additional signals to the search ranking algorithm if used on Commons.

While depicts statements are tremendously useful in helping surface additional relevant results, it's hard to use them as a ranking signal: textual descriptions often convey the relative importance of subjects better than these simple statements can.

詳細さの程度

ウィキデータにはさまざまな実態が登録されていて精度のレベルも同じではありません。検索結果に「入れ子の概念」を組み込めるようにしたいと取り組んでいますが、特に全文検索と比較した場合に、実体の重みづけは慎重にしておきます。

一例として写真の説明をするなら橋 (Q12280)、吊橋(Q12570)、ゴールデンゲート・ブリッジ(Q44440)、観光名所(Q570116)を使えるのは事実として、ゴールデンゲート・ブリッジ(Q44440)そのものは、関連のさまざまな実体がこれらで説明してあるのです。

しかしながら、現実にはそう単純に解決しない事例が散見されます。

犬種のジャーマンシェパード(Q38280)はイヌの下位クラス(Q144)すなわちその上位クラスはペット(Q39201) - 理論上は「ペット」の写真を検索すると、ジャーマンシェパードをタグ付けした写真をが検出されるはずです。

しかしながら、写真の中には、題名は「ジャーマンシェパード犬」なのに実は被写体は使役犬(Q1806324)であって愛玩犬ではない場合があり得ます。