Google Summer of Code/2007
The Wikimedia Foundation is a mentoring organization for the 2007 Google Summer of Code.
Note: the project ideas list will receive a little more cleanup over the next few weeks. Additional ideas can be added here.
Heuristics for vandalism
Some heuristics for vandalism already exist for the IRC notification module. The next step would be using those heuristics internally in the Web app, perhaps flagging suspicious edits, or workflow (let admins “claim” a bad edit to work on). Extended heuristics may use Bayesian filters. – clean hooks for gathering info, clean hooks to tag things, and then having the tagged info available, and then tool-building, like Spamassassin's rules engine.
Article quality rating analysis
Work out ways of combining ratings of different versions of articles from different sources that estimates the "true" quality profile of an article as well as possible, whilst rejecting noise and outliers, and resisting attempts to game or spam the ratings system. You may need to collect some real data somehow.
Deletion queue system
A clean system where pages or media files can be nominated for cleanup and deletion, and that process followed through to completion with a minimum of fuss, could help streamline processes on both large and small wikis.
Notification to the author(s) of nominated pages and a discussion system that's easy to get involved with and track are a must.
Detailed page merges
- "The only real problem I see about merging page histories is that it messes up the history of the combined pages, where it is difficult (I would say virtually impossible) to sort out what page edit was from what earlier page."
Perhaps this could be remedied.
While MediaWiki has had an web-based installer for some time, there are issues with upgrades, maintenance, and other tasks for smaller sites which don't have shell access to their hosts.
A password-protected web interface to the maintenance and update tools, and perhaps even some site configuration options, would be very useful for such environments.
Wikimedia needs a counter program which will process ~30k log lines per second using a single processor. It should collect per-article page view counts and produce a report on demand. A web frontend should format the results and present them to the public.
Wikimedia projects need a more realistic method of determining authorship of a given article, particularly for compliance with republication requirements of the GFDL section 2(B). While the GFDL doesn't give and specific rules about what might qualify somebody to be recognized as an author of any GFDL'd text that has been written using a Wiki interface, there are some interesting directions to go here beyond just a simple edit count of articles.
Some interesting potential exclusions would be to try and eliminate vandal edits and administrative actions (such as cleanup/AfD or VfD templates/edit war moderation) from any list of such authors. In theory (although from a CPU processing standpoint impractical on larger articles) you ought to be able to determine the exact authorship of each and every word in a MediaWiki environment. So is principle authorship determined by edit count or word count? It would be something interesting to explore how much of a difference such distinctions might make. Does compliance with the GFDL in a 10,000 word article mean listing five authors who have each only contributed one word each out of the 10,000 words in the article? What about the other 9,995 words and the people who contributed those words?
In the case of Wikibooks in particular and some similar multi-page projects elsewhere on Wikimedia projects, you have the additional complication of having to merge multiple page histories together in a systematic fashion. English Wikibooks has made formal page naming policies that help to simplify this process in an algorithmic approach (and this has been more or less duplicated elsewhere on other Wikimedia projects). The point here is that authorship of an aggregation of pages needs to be determined as well, and is critical information from a legal standpoint if we want to have anybody re-use Wikimedia content.
- A diff version which determine the author of each word (taking into account things like reverts) have been presented months ago. See wikimedia-tech-l archives Platonides 19:53, 29 May 2007 (UTC)
Provide an api for extensions to add user preferences, so that e.g. a new tab shows up in the preferences or a new preference shows up in an existing tab. This would probably be good to do alongside the admin panel idea above. It could define a few basic types including strings, numbers, booleans and colours, as well as some simple rules for when to show/hide or enable/disable settings.
No-install in-browser display of video and audio clips for Wikimedia Commons, using reasonably common Java and/or Flash components. Needs to be able to 1) play or transparently pre-convert Ogg Theora videos, 2) avoid use of patent-encumbered formats.
Some work was done on this last year, adapting Fluendo's Cortado Java player applet. Completing this and integrating into the primary code base would be a very valuable project.
Automated conversion of media formats on upload would also make it easier to get more media into the system.
- Conversion of uncompressed or FLAC audio formats to suitable streaming-bandwidth Ogg Vorbis
- Conversion of high-resolution, high-bitrate videos to lower-quality Ogg Theora suitable for streaming and casual download
- Conversion of possibly patent-encumbered audio and video formats (MP3, MPEG-4, etc) to free formats (Vorbis/Theora)
Conversion would have to work on in an asynchronous queue to keep uploading snappy, with queue status reported back to the user interface (eg to indicate that a recompressed version is not yet available, or has become available).
Upload form improvements
Integration into MediaWiki. These include user statistics, characteristics about articles and graphing the development. Most tools are currently external and not working together. Yet.
The English Wikipedia has a variety of useful reports which editors use to zero in on problem articles. There is an existing Perl code base for some of these (see Toolserver/Reports) but it needs bug-fixes, integration with the Toolserver (or other platform, so they can be run automatically), the possibility for editors to mark false positives and "dealt-with" articles, improvements based on editor suggestions, and expansion to cover more reports.