WikiPics

From mediawiki.org

WikiPics is a system for searching and navigationg images in Wikimedia Commons in the user's native language. A prototype is available at http://toolserver.org/~daniel/wikipics/ currently covering 8 languages (en, de, fr, it, nl, pl, es, pt).

WikiPics is on ongoing research and development project run by Wikimedia Deutschland e.V.

Functionality[edit]

Here's a rough outline of the main use case:

  1. the user selects a language and enters a search term or phrase
  2. WikiPics finds the possible meanings (topics) for that phrase and lists them. For each meaning, there's a title and a definition given in the user's language (taken from the corresponding wikipedia article), and a few images for the topic are also shown.
  3. The user picks one of the topics
  4. WikiPics then shows a detailed view for this topic, with many more pictures, and links to related topics (broader, narrower, similar). Images are ranked according to usage on wiki pages and quality markers (valued image, etc).
  5. The user chooses images or navigates to the detail views of the related topics.

What WikiPics is not[edit]

  • WikiPics is not a full text search engine. In fact, it does not analyse image description pages at all.
  • WikiPics does not trasclate. Instead, it looks up the abstract "topic" a given phrase refers to, and then finds images for this topic.
  • WikiPics does not use query expansion. Again, Wikipics looks up the abstract "topic" a given phrase refers to, and then finds images for this topic.

Future[edit]

  • Filter by license property (only im,ages that don't require share-alike)
  • Filter by media type (only svg, only video, etc)
  • Multi-Phrase searches (intersection)
  • Seep search (include subcategories)
  • light box (basket), download as ZIP.

Data Design[edit]

Wikipics needs the following information in order to function:

  • all meanings (topics) for a given term/phrase (ranked by likely hood, i.e. frequency)
  • for a given topic:
    • wikipiedia articles/categories in each language (if they exist), as well as the corresponding galleries/categories on commons
    • definitions in each language (if the corresponding article exists)
    • images for a given topic. That is, the images used on the articles/galleries or included in a category that belongs to the given topic.
    • popularity rank (sum of in-degree of all wikipedia articles)
  • for each image:
    • quality assessment tags, feature tags
    • problem tags
    • license tags
    • popularity rank (number of wiki-pages using it, whether it's in the corresponding commons category, whether it has quality markers, etc)

Architecture[edit]

  1. build a thesaurus from XML dumps and/or the wiki databases.
  2. build a lucene index for finding topics for terms in each language
  3. attach the desired attributes (rank, related topics, etc) to the topics represented in lucene
  4. build a lucene index for finding images for topics (using the thesaurus together with the wiki database, especially globalimagelinks)
  5. attach the desired attributes (rank, tags) to the files

Work Packages[edit]

  1. Thesaurus Creation (Wiki Mining)
    Already covered by the WikiWord project [1]. Could however be rewritten/modified with a focus on the information needed for WikiPics.
  2. Index generation
    Should be streight forward using information from the WikiWord thesaurus and the wiki database.
  3. Search frontend (Special Page)
    A simple frontend (as a MediaWiki extension) for this wouldn't be too hard, but a nice Ajaxy UI would be need some more work.
  4. Deployment
    Integration into the Wikimedia infrastructure for Lucene index generation, activation of the corresponding MediaWiki extension.