Google Summer of Code/2012

MediaWiki is participating in Google Summer of Code (GSoC) 2012. Sumana Harihareswara is managing MediaWiki's participation in GSoC and the backup administrator is Gregory Varnum.

Read the GSoC FAQ and the student guide to GSoC.

Relevant dates

 * Student applications begin March 26.
 * Student application deadline is April 6.
 * Announcement of accepted students: April 23.
 * Students start their fulltime work: May 21.

Timeline for reference.

Student applications
Google will open student applications on March 26th. Dozens of students apply every year and we only accept fewer than ten of them, so this is an exclusive program. You can choose one of the project ideas listed below, or invent your own project idea. (Here are tips on writing a proposal.) Write up a proposal following our proposal guidelines on a subpage of your userpage (example), and start discussing it with us.

To increase your chances of being accepted, start the intro steps to learn MediaWiki. Also do the diff and patch and git training missions, and get an account for our Git repository. If you have any trouble at all, please talk with us on IRC. If you're a bit shy and don't want to ask your question of the whole room, look for varnent or sumanah.

Also try to start contributing: fix an annoying little bug. This experience will help you write a good proposal that fits our guidelines, and improve your credibility as an applicant.

We will strongly prefer students who can demonstrate the ability to work with our existing codebases, communicate well, and participate actively in our development community even after August is over.

Spread the word
Please also help us publicize GSoC at your school!


 * A leaflet about student MediaWiki contribution to print out and hang on the wall
 * A quarter-page flyer about MediaWiki contribution, to hand out
 * A leaflet about GSoC to hang up or pass out

Project ideas
Applicants will need to write good proposals either based on their own ideas or expanding thoughtfully on the ideas below.

Wikipedia Corpus Tools
Goal: automate extraction, cleaning up of Wikipedia text in leading language to a corpus. Choose a suitable corpus format
 * Deliverables:
 * a framework for handling different languages.
 * train sentence chunkers
 * integration part of speech tagging
 * POS/tagged Wikipedia dumps.
 * develop a heuristic to handle spam revisions (discarded or build a spam corpus)

Mentor: Oren Bochman

Lucene Lemma Analyzers based on Morphology Extraction from Wikipedia Text
Mentor: Oren Bochman
 * 1) use & expand morphology induction software to process exiting languages.
 * 2) Make a lucene plugin to normalize search on a lemma level based on the above data

Lucene Automatic Query Expansion from Wikipedia Text
Mentor: Oren Bochman
 * 1) Data-mine Multiple Wiktionary dumps to build a highly multilingual word net.
 * 2) Make a Lucene filter which uses such a wordnet to expand search terms.

Create a way to have “books” for wikisource/wikibooks
Component: New extension (or improve BookManager/Collection)

Expected results: A way of editing and manage group of pages(books) efficiently and a basic interface of reading books.

Short explanation:This implies we will be able to have (at least) these much wanted features for book editing and manage, including “watch” books (sets of pages, watch a category) instead of just single pages. And also create an basic interface to read a book (a little better than that)

Prerequisites: PHP, ajax, database, GUI Design is a plus.

Mentor signup: Raylton P. Sousa Questions & suggestions:
 * Aashish Mittal is interested.
 * Does this imply that you want to have a feature just like we have in Google Books where we can preview a Book ? or does this mean to have a watchlist feature for books just like we have for single pages ?- chughakshay16
 * I believe this implies we will be able to have (at least) these much wanted features. See also Extension:BookManager. Helder 23:43, 8 February 2012 (UTC)
 * I believe that the long-time dreamed of idea that people could have watchlist "folders" is similar to this. Relatedly there is the idea that multiple people could share the same watchlist so they can collectively manage articles (this would be very useful for Wikiproject groups). Wittylama (talk) 23:35, 19 March 2012 (UTC)

"Who's been awesome?"
Functionality requested: after a user has made 100 edits to a wiki, show a link somewhere in the sidebar or top navigation asking, "who's been awesome?" The link takes the user to a page where she can specifically name other users of that wiki to praise them for their help and work. The data regarding users who have been named and thanked could be available publicly via an API, and potentially feed into the MoodBar dashboard or some other public venue or it could be kept private. (I want to use it to give away free merchandise for example ;) ) Suggestion from James Alexander, who offers to mentor.
 * Eranga Mapa is interested.

Backwards-compatibility extension
This would be a place to park deprecated features so that they don't clutter up the core software, but are still available for use by extensions that aren't in our repository. The extension would need to be bundled with the MediaWiki installer, and enabled by default (which is pretty easy as of 1.18). The initial version could declare deprecated global functions (e.g. wfUILang, wfViewPrevNext, wfDoUpdates, wfCreateObject, wfStreamFile, wfLoadExtensionMessages, and wfOut) and global variables (e.g. $wgArticle). The next phase can be about figuring out how to cleanly extend classes to include deprecated methods. Bonus feature would be to expose use of deprecated features in the UI for a logged-in admin. -- RobLa-WMF 20:23, 12 February 2012 (UTC)
 * Siebrand Mazeland is willing to mentor this.

OpenStackManager work
Extension:OpenStackManager has various defects and needs various enhancements -- Ryan Lane will mentor.

Take RefToolbar to the next level
RefToolbar is used by many WMF wikis. It needs usability improvements and we would like to see someone turn it into an extension. You would convert RefToolbar into an extension - it currently lives in en.wiki Common.js, so isn't really a "gadget" anymore, and needs extensionification! Ryan Kaldari wants this. See New_Editor_Engagement/Smaller_issues.
 * Shoukldn't it be made a global gadget instead? Once we have global gadgets... Max Semenik (talk) 15:03, 20 March 2012 (UTC)

Integrate "upload from Flickr"
Add Flickr integration to UploadWizard. The UploadWizard should have an interface for automatically transferring an image from Flickr, including the image's metadata and license. There is already some code written for getting the license of an image on Flickr and translating it into a Commons license template. Suggestion by Ryan Kaldari.
 * Nitesh Kumar is interested.

Taxobox editing/management interface
Create a usable interface for editing and managing the automatic taxobox. This would probably be either a Javascript gadget or a MediaWiki extension. Suggestion by Ryan Kaldari.

Green SMW
Component: Semantic MediaWiki (core)

Expected results: Reducing the carbon footprint of Semantic MediaWiki

Short explanation: You can think of good software performance as not just about scalability or fast user experience, but about energy consumption - so optimizing SMW can help the environment! SMW is "wasting energy" in many places. Data is stored even if it has not changed, in-memory caching facilities are ignored, expensive special pages do not cache results at all. This project is about improving this in as many places as possible. It will be split into smaller subtasks that can be solved independently. Participants should enjoy optimization and be willing to familiarize themselves with existing code. Extensive guidance will be provided.

''This project relates to all features of SMW and involves different technologies (caching, profiling, database optimization, etc.). Participants must have a strong interest in learning new techniques, understanding existing code, and finding new solutions.''

Prerequisites: quick grasp of complex software and technology, programming experience, basic knowledge of PHP, optional plus: knowledge of performance enhancing technologies (caches, profilers, database tools, etc.)

Mentor signup: Markus Krötzsch
 * Nischay Nahata is interested.
 * Yipeng Huang is interested.

SMW query management and smart updates
Component: SMW (core)

Expected results: New capabilities added to SMW

Short explanation: Query management is a proposed addition to the capabilities of Semantic MediaWiki that would allow automatic updating of queries and gathering of query statistics. This would work by storing query meta data as semantic properties, which can then be queried. Query management would allow automatic updating of query results when their source data is modified. This ensures up-to-date query results everywhere, without the need of more resource-intensive solutions like disabling the cache, or rebuilding all pages via a cron-job. This automatic updating is made possible by storing query dependencies among the query meta-data. Query management would allow you to query various things about query usage such as where queries are located, how much dependencies they have, how long/expensive they are, time of their last update, ect. With this information you can get a better overview of how queries are used across your wiki and pinpoint inefficient usage.

Prerequisites: prior programming experience, working knowledge of PHP, decent database knowledge is a plus

Mentor signup: Jeroen De Dauw, Markus Krötzsch


 * Shakthi Velmani is interested.

More powerful result formatting for SMW
Component: SMW (core)

Expected results: Modifications to SMW and additions to SMW extensions

Short explanation: SMW and its extensions provide many features for displaying data in wiki pages. The goal of this project is to further improve these features to provide more dynamic and flexible result views for users. Concrete tasks include (but are not limited to):


 * Support for interactive result formats that allow you to filter and expand data (obviously each format will have to implement this itself, but a general infrastructure to handle HTTP reqs needed for this functionality in SMW would be very nice)
 * Syntax extension to group results
 * Syntax extension to allow modifying of the page name based on properties. Might make sense to implement this as a more general feature.
 * More control over the display of values for properties. For example when using #show to list all attendees of a single event, you can't really change the display right now.

Prerequisites: PHP, JavaScript

Mentor signup: Jeroen De Dauw

An elegant and simple database layer for SMW
Component: SMW (core)

Expected results: A simpler and cleaner database schema for storing SMW's data

Short explanation: The data that SMW stores is very simple, mainly properties and values using a dozen simple datatypes. The relational database storage code of SMW is not simple, mainly because the data model in SMW has been simplified only after the current code was written. The proposal is to start from scratch: look at the data model and the required access methods, and create a cleaner database access class that is easy to maintain. Due to the new RDF store connectors of SMW, the most complicated code for query answering is not needed, and one can focus on a simple data exchange layer. This project is for those who enjoy streamlining code to make it more elegant and efficient.

Prerequisites: desire to write simple code, prior programming experience, basic knowledge of PHP and SQL

Mentor signup: Markus Krötzsch
 * Yipeng Huang is interested.

Improving the interplay between Spark and SMW
Component: Semantic Result Formats and Spark

Expected results: Being able to use Spark with Semantic MediaWiki as the backend store easily, and using Spark within SMW with data from an external source

Short explanation: Spark is a JavaScript library which allows to take SPARQL query results and visualize them within any HTML5 site. It is basically like inline queries in SMW, but against any SPARQL endpoint and with no required backend. The idea would be to extend Spark so that it can be used against SMW data and not only against SPARQL endpoints, explore if the #ask syntax makes sense, and add a Semantic Result Format that allows to integrate Spark into Semantic MediaWiki.

Prerequisites: JavaScript, PHP (a little)

Mentor signup: Jeroen De Dauw

Adding unit tests to SMW
Component: SMW core, possibly extensions

Expected results: Create unit tests (mainly PHPUnit, possibly also some QUnit) so we notice when something breaks.

Short explanation: SMW currently has no unit tests, and really could use some as subtle behaviour changes get intrduced over time that are not documented and often not intended. A good place to start adding tests are the DataValue classes, which can use testing for their parsing and their formatting methods.

Prerequisites: PHP

Mentor signup: Jeroen De Dauw

Semantic Drilldown improvements
Component: Semantic Drilldown

Expected results: Various improvements to Semantic Drilldown.

Short explanation: Semantic Drilldown is an extension that lets users drill down on pages via semantic properties. It is a popular extension (and one of only a handful of SMW extensions enabled on Wikia), but it has a number of important weaknesses:


 * Compound data defined within pages, using either subobjects or Semantic Internal Objects, cannot be filtered on.
 * "Concepts" cannot be filtered on.
 * Results can't be shown in multiple formats at the same time (like a map and a list).
 * The display of results in columns can be awkward (see here, for example).
 * When drilling through subcategories, the full path of subcategories isn't shown.
 * The interface currently doesn't offer flexibility between doing an "AND" and "OR" (or "NOT", for that matter) of different values.
 * It may be possible to improve the extension's performance, using some sort of caching system.

This project would try to make some or all of these improvements.

Mentor: Yaron Koren

Google Spreadsheet Result Format for SRF
Component: Semantic Result Formats

Expected results: Exports inline query results to a worksheet in a given Google Docs spreadsheet.

Short explanation: Knowledge Workers use/abuse spreadsheets for a lot of their data processing needs. Having the ability to save query results to spreadsheets should increase adoption and has the added benefit of gaining access to additional spreadsheet features for data post-processing (e.g. formulas, pivot tables, etc.). In addition, Google Charts can use a Google Spreadsheet as a datasource. This gives the ability of creating complex dynamic charts with full access to the Google Charts featureset that can live outside of SMW but still use data from the wiki. Right now, current SRF visualizations are constrained to use a very limited subset of their underlying visualization frameworks. Perhaps, a complementary Google Charts extension can even be made to utilize transformed query result data that was post-processed by a Google Spreadsheet that was initially populated by this result format.

Finally, using Google APIs/resources should garner some brownie points from GSoC administrators.

Prerequisites: PHP, Google Spreadsheets API, Google Documents List API

Mentor: Joel Natividad

Semantic Forms Rules
Component: New extension

Expected results: Enable a much more dynamic behavior of Semantic Forms than currently possible

Short explanation: Currently dynamic behavior of Semantic Forms is achieved by adding additional parameters to inputs (or by having dedicated inputs). This is a good approach where the dynamic behavior only concerns one input. Autocompletion would be an example. It becomes awkward when more than one field is concerned, e.g. with show on select. And it fails altogether, if more than two inputs are involved or if the desired behavior is more complex. Generally any behavior involving two or more inputs should not be handled by one of the affected inputs, but by a dedicated controlling entity that exists once for every form. In fact, in many cases this would even be beneficial where only one input is involved, as it would enable one behavior for multiple input types without having to duplicate code.

The idea now is to define rules consisting of triggers, conditions and actions. The conditions are evaluated when certain triggers are detected. Depending on the result the associated actions would be performed. Individual rules could be kept very simple, but the result of the evaluation of a rule would be stored and could later be referenced in other rules. This way rules could be chained and more complex rules could be constructed from simple ones. See a preliminary spec.

Prerequisites: JavaScript (incl. jQuery), PHP

Mentor signup: Stephan Gambke

Improve the PDF Download tool of Wikipedia
Wikipedia has a great tool for downloading articles as PDF file, in addition the book feature gives us the opportunity to combine and download multiple articles as a collection. But the problem is this tool do not function properly for the Bengali and other indic language articles. I informed this issue in the wikitech-l mailing list. Based on the discussion i have prepared a project proposal and i wish to apply GSoC with this idea. -- Nasir Khan Saikat •  talk  • 08:28, 22 March 2012 (UTC)

Ideas suggested by others
If you are interested in these ideas, you should start looking for a potential mentor yourself, by asking on or our wikitech-l mailing list.

Pre-commit checks
Implement pre- or post-commit checks in our code repositories that automatically look for security vulnerabilities, broken coding conventions, broken code, etc, perhaps with a web interface to facilitate the process, or using our Jenkins continuous integration setup. Help MediaWiki developers save time in code review, and make our code more secure.
 * This should be done via Jenkins for sure. We already have a build project for it, but we haven't taken the time to actually write our coding conventions down in this format--that could make a good summer project!
 * Akash Nawani mentioned some interest in this.

Workflow management for Wikinews
Wikinews relies on several gadgets as primitive forms of "workflow management" in transitioning a news article from developing through to published. Regrettably, these are fragile, scary, scary javascript everywhere, and not at-all well-integrated with MediaWiki and the FlaggedRevisions extension in general. With FlaggedRevisions use expanding, and the multiple levels within it being suited to GA/FA article promotion in a Wikipedia-type environment, there is scope beyond that on Wikinews for an extension which handles reviewing articles and pushing them up or down (review pass/fail). I know this is particularly late in the day regarding potentially getting things on-the-table for GSoC, but will put some work over the next 24 hrs into documenting the current state-of-play, and a more-ideal workflow where certain components of an extension might fit in. Suggestion by Brian McNeil; to produce outlines on sub-pages of own userspace.

Index Transcluded text in Search
Search should index transcluded text — This is sorely wanted for Wiktionary, Wikibooks, Wikisource and the like that depend heavily on template expansion.

This is an important bug fix - but the actual bug fix IMHO is about a week's work. However it likely to have negative impact on search engine performance by doubling the index sizes). The work would require access to the production cluster which WMF has to this date been reluctant to provide. However, this feature will be part of the first Solr release.

Suggest a mobile task
Improve our Android application -- integrate with SuggestBot to suggest a mobile task to a user.

Search embedded DjVu text
Extract embedded text from DjVu documents for search - may not be a big enough project for GSoC, so maybe you would add a few other formats as well? Sumana Harihareswara, Wikimedia Foundation Volunteer Development Coordinator (talk) 01:07, 23 February 2012 (UTC)

Note: many formats are suported by Apache Tika which will be intergraed via configuration once Solr is up and running. adding DjVu support to Tika is in thier Jira TIKA-513 but it is indeed a small project.

Groups in watchlists
Give editors a way to slice and dice their watchlists with groups (perhaps group similar pages in their watchlists). Support for followup reminders in watch list. (I.E. lets see what happens with this new page a week after a new user starts to work on it)


 * Vivek Bagaria(potter) is interested in this.

Convention extension
Write a Convention extension that will turn any MediaWiki wiki into a suitable website for a conference such as Wikimania. The extension would manage registration, talk submission/voting, etc. For comparison see OpenConferenceWare and WisConDB.
 * Akshay Chugh has expressed interest in this idea.

"Upload to Commons" plugins
Upload to Commons plugins for common photo management apps (iPhoto, Aperture, Lightroom, Picasa, Shotwell, F-spot, etc.) similar to existing plugins that allow synchronization with sites like Flickr, facebook.
 * See mailing list discussion

Mass upload tools
Improve volunteer-written mass upload tools to make them more robust, especially for use by museums, galleries, libraries and archives. Use Python, PHP, or jQuery with the MediaWiki API and help thousands of people share free culture. For an overview of current tools, see https://outreach.wikimedia.org/wiki/GLAM/Newsletter/January_2012/Contents/Tool_testing_report -- also see the tool requests page.

Translation proofreading app
Mobile translation proofreading app for Translate extension

Translation spellchecking
Integrate spell checking into Translate extension

Use museum metadata
Find a way to use the new Cooper-Hewitt museum metadata (example) to improve Wikipedia articles and Wikimedia Commons metadata. (See the GLAM page for some thoughts.)

Easier planet configuration
Redo Wikimedia Planet blog aggregator. Make it easier to add languages and blogs without (or with minimum) developer involvement.
 * Perhaps instead of trying to stretch MediaWiki in this way we should just install Venus or something like it.

Fast category traversal
CatGraph: allowing fast category traversal (and thus deep category intersection) a) on the Toolserver and b) as part of Wikipedia search. This would allow us to use categories as tags, and would remove the need to maintain intersection-categories like "American Authors of the 20th Century" by hand. Daniel Kinzler is working on a toolserver-based prototype. Ops is a challenge: full integration needs a lot of RAM (but not much compared to what is currently used for Lucene).

Improve WikiPraise UI, integrate blame map
WikiPraise: this may be too small for GSoC, but it would be worthwhile to get the praise/blame/cite/reuse features fully integrated in the UI. This is mostly a jQuery / UX project. You would use WikiTrust data via the SURE service on Toolserver.org, which means you would not not going to get production quality data. But an additional subproject is: full blame map integration', providing production quality blame maps. Luca de Alfaro is working on a proposal for this. The development side looks pretty manageable; perhaps he would be willing to mentor this as a GSoC project? The main challenge here is operations, as this needs lots of storage if applied to the full body of Wikimedia projects. Tim Starling thinks it's feasible, though. (Suggested by Daniel Kinzler.)

Extend and improve RDBMS support of SMW
Component: SMW

Expected results: Additions to the relational database store support of SMW, and improvements to the existing implementation.

Short explanation: SMW currently has support for MySQL and, to some degree, PostgreSQL. This is done via a single "SQL store", which varies its behaviour slightly based on the type of database used. Having 2 separate stores would likely be better. This store currently also lacks support for special data types, such as geographical entities, which cannot be interacted with, since this would require putting SQL functions around field names or values in SQL statements, which is currently not possible. Adding support for the other RDBMSes (partially) supported by MediaWiki core (mainly Oracle and MSSQL) would also be nice.

Prerequisites: PHP, SQL, the RDBMS you want to work with

Mobile support for GUIs for SMW and SF
Component: SMW core, Semantic Forms, possibly other SMW extensions with GUIs

Expected results: Make versions of SMW's and related extensions' graphical user interfaces that are optimized for mobile devices. In particular, SMW's Special:Ask and Semantic Forms' Special:FormEdit.

Short explanation: Currently SMW lacks specific support for mobile. Although the interfaces are usable on mobile devices, they could be a lot more optimized.

Prerequisites: PHP, UI experience

Interactive Ontology Visualizer/Navigator
Component: New Extension

Expected results: Visualize Ontology using D3.js or similar Javascript framework, and use it to navigate the wiki

Short explanation: There is currently no graphical, interactive way to visualize the Wiki Ontology. In the Halo project, the Data Explorer (nee Ontology Browser) and Semantic Treeview extension have served the purpose to a limited extent. Perhaps we can use the D3 Graph format, with each node represented by either specially declared instance images (perhaps thru an instance avatar image property) and/or a category avatar image. The visualization also doubles as a navigator that automatically adjusts as you browse the wiki.

Prerequisites: PHP, Javascript, D3.js

OData Result Format
Component: Semantic Result Formats

Expected results: Query Results are published using the OData specification at persistent URLs

Short explanation: Even though SMW primarily publishes RDF data, in compliance with true open standards, it would be great to support the Open Data Protocol as well to "embrace and co-opt" this Microsoft specification for Open Data as a lot of enterprise-class tools and customers use OData. Of primary interest is Tableau Public, a very powerful visualization tool, which supports OData data sources even for the Free Public version.

Prerequisites: PHP

API improvements to benefit Wikibooks/Wikiversity
I wonder about whether there are things we could do with regards to Wikibooks and Wikiversity to improve their reuse/integrability -- this may require new features in our web API.
 * In what sense are you (or Mediawiki) looking to make the Wikibooks and Wikiversity Web API more reusable ? Is it in terms of improving its already existing API or just developing front end for exploiting the API ?  ... does that mean you were referring to creating a client that would leverage the potential of these two wikis? - chughakshay16
 * I was on an airplane next to an educator, and he asked when Wikipedia was going to get into the educational curriculum game. Obviously he did not know about Wikiversity and Wikibooks!  So I told him.  And then he basically asked to what extent they were reusable programmatically, that is, are the contents usable as building blocks for other content management and curriculum creation systems elsewhere?  I think a student who is interested in this question should investigate what is desired by the educational community, see what is technically feasible, and write a proposal. Sumana Harihareswara, Wikimedia Foundation Volunteer Development Coordinator 08:02, 8 February 2012 (UTC)
 * Are you talking about something like moodle for use by universities and schools on Wikipedia -nischayn22

Integration with education management tools
MediaWiki/Moodle or MediaWiki/Sakai or MediaWiki/Blackboard or MediaWiki/Desire2Learn integration.

Image recognition
Automatically tagging photos in Wikimedia Commons using computerized object recognition. Discussion on wikitech-l started by Maarten Dammers. Maarten's suggestions. Caution: some students are interested in this but no mentor has stepped forward. :-(

Miscellaneous places to find other project ideas

 * Small "new editor engagement" issues
 * the Wikisource wishlist
 * The GSoC ideas of 2011
 * past projects to help you think of similar ideas
 * Also accepted to GSoC this year: DBPedia