User:Johang~mediawikiwiki/Most popular related articles

Comments and feedback is welcome.

Identity
Name: Johan Gunnarsson Email: johan.gunnarsson@gmail.com Project title: Most popular related articles

Contact/working info
Timezone: CEST (UTC +2) Typical working hours: 10:00-18:00 IRC or IM networks/handle(s): johang@freenode (IRC)

Abstract
This project aims to resolve bug 21921, discussing how to encourage contributions to Wikipedia and its sister projects. The bug reporter proposes adding a sidebar listing the N popular pages that relate to the page currently being viewed. This would introduce another way of navigating to articles the user is likely to be interested in, and therefore more likely to contribute to. Ranking with respect to popularity helps to bring attention to articles touching events happening now.

Implementation details
Anything integrated in Wikipedia has strict performance requirements and must be able to scale due to its massive traffic. The front-end of my application (i.e. the list of links) must be as static and cachable as possible. My plan to accommodate this to pregenerate as much as possible from batch jobs running on some external server (such as Toolserver), and then include it into WikiMedia from an extension.

SkinBuildSidebar looks like a useful hook to inject a new sidebar into WikiMedia. There is a bunch of extensions using this hooks that can be useful resources. .

About me
I'm a student in Computer Science and Engineering at Lund Institute of Technology, Lund, Sweden. I'm on my last year, currently working on my Master's Thesis project and hopefully graduating this summer.

Relevant experience

 * Participant of Google Summer of Code 2007 for GNU phpGroupWare.
 * Author of Wikitrends. Data crunching project inspired by Google Trends and Twitter Trends to find the pages with greatest uptrend on Wikipedia right now. Working on moving the project to Toolserver.
 * I have toolserver.org account.
 * Fluent in computers and code.

Required deliverables

 * 1) Batch processing system to find and rank related pages using data sources such as wikilinks, categories, edit counts and page views. Probably to run at Toolserver as a batch job.
 * 2) System to serve related pages to clients. Probably to run at Toolserver as a web application.
 * 3) Client fetching related pages and injecting into Wikipedia article layout. Could take different forms. Either as a client-side Javascript, a Greasemonkey script or an extension to MediaWiki.

If time permits

 * 1) Investigate different ways of finding related articles. One way could be to combine different sources, like categories and wikilinks, with weights. It would also be interesting to walk further down the wikilink/category graph. Articles of subcategories can be counted as related too, although this would most likely be more computationally intensive.
 * 2) What articles that are related and not can of course be subjective to the reader. If time permits I could generate sets of related articles generated by different algorithms and ask people what they think is better.

Project schedule
My project consists has 4 milestones.


 * Research, investigate and choose a few candidate algorithms for finding and ranking related articles.
 * Evaluate and decide which algorithm suits this project the best.
 * Implement the one I chose in a scalable way.
 * Integrate with WikiMedia by building an extension that presents the list of links in the article sidebar.

Mockups

 * Mockup of UI element in the sidebar of a regular Wikipedia page.