User:Johang~mediawikiwiki/Most popular related articles
Comments and feedback is welcome.
Name: Johan Gunnarsson
Project title: Most popular related articles
Timezone: CEST (UTC +2)
Typical working hours: 10:00-18:00
IRC or IM networks/handle(s): johang@freenode (IRC)
This project aims to resolve bug 21921, discussing how to encourage contributions to Wikipedia and its sister projects. The bug reporter proposes adding a sidebar listing the N popular pages that relate to the page currently being viewed. This would introduce another way of navigating to articles the user is likely to be interested in, and therefore more likely to contribute to. Ranking with respect to popularity helps to bring attention to articles touching events happening now.
Anything integrated in Wikipedia has strict performance requirements and must be able to scale due to its massive traffic. The front-end of my application (i.e. the list of links) must be as static and cachable as possible. My plan to accommodate this to pregenerate as much as possible from batch jobs running on some external server (such as Toolserver), and then include it into WikiMedia from an extension.
I'm a student in Computer Science and Engineering at Lund Institute of Technology, Lund, Sweden. I'm on my last year, currently working on my Master's Thesis project and hopefully graduating this summer.
- Participant of Google Summer of Code 2007 for GNU phpGroupWare.
- Author of Wikitrends. Data crunching project inspired by Google Trends and Twitter Trends to find the pages with greatest uptrend on Wikipedia right now. Working on moving the project to Toolserver.
- I have toolserver.org account.
- Fluent in computers and code.
- Batch processing system to find and rank related pages using data sources such as wikilinks, categories, edit counts and page views. Probably to run at Toolserver as a batch job.
- System to serve related pages to clients. Probably to run at Toolserver as a web application.
If time permits
- Investigate different ways of finding related articles. One way could be to combine different sources, like categories and wikilinks, with weights. It would also be interesting to walk further down the wikilink/category graph. Articles of subcategories can be counted as related too, although this would most likely be more computationally intensive.
- What articles that are related and not can of course be subjective to the reader. If time permits I could generate sets of related articles generated by different algorithms and ask people what they think is better.
My project consists has 4 milestones.
- Research, investigate and choose a few candidate algorithms for finding and ranking related articles.
- Evaluate and decide which algorithm suits this project the best.
- Implement the one I chose in a scalable way.
- Integrate with WikiMedia by building an extension that presents the list of links in the article sidebar.