Outreach programs/Possible projects

We are using this list of projects as a master branch for Mentorship programs such as Google Summer of Code and Outreach Program for Women. The projects listed are good for students and first time contributors but they require a good amount of work. They might also be good candidates for Individual Engagement Grants.


 * Featured project ideas usually have mentors ready for you to jump in.
 * Raw projects are interesting ideas that have been proposed but might lack definition, consensus or mentors, and therefore we can't feature them. If you're interested in one of those, wonderful! You'll need to work a bit more to improve their fundamentals.

If you are looking for smaller tasks check the Annoying little bugs. For a more generic introduction check How to contribute.



Be part of something big
We believe that knowledge should be free for every human being. We prioritize efforts that empower disadvantaged and underrepresented communities, and that help overcome barriers to participation. We believe in mass collaboration, diversity and consensus building to achieve our goals.

Wikipedia has become the fifth most-visited site in the world, used by more than 400 million people every month in more than 270 languages. We have other content projects including Wikimedia Commons, Wikidata and the most recent one, Wikivoyage. We also maintain the MediaWiki engine and a wide collection of open source software projects around it.

But there is much more we can do: stabilize infrastructure, increase participation, improve quality, increase reach, encourage innovation.

You can help to these goals in many ways. Below you have some selected ideas.

Where to start
Maybe at this point your proposal is just a vague idea and you want to get some feedback before investing much more time planning it? We know this feeling very well! Just send an email to wikitech-l (or qgil@undefinedwikimedia.org if you prefer) sharing what you have in mind. One short paragraph can be enough to get back to you and help you working in the right direction.

Learn and discuss
Obligatory reading:
 * Any potential contributor new to our community is encouraged to follow the Landing instructions.
 * How to become a MediaWiki hacker is a good place to start learning your skills and becoming a better candidate.
 * Lessons learned for mentorship programs.

To set up your MediaWiki developer environment, we recommend you start installing a local instance using mediawiki-vagrant. You can also have a fresh MediaWiki to test on a remote server. Just get developer access and request your own instance at Wikitech.

If you have general questions you can start asking at the |Discussion page. IRC channel is also a good place to find people and answers. We do our best connecting project proposals with Bugzilla reports and/or wiki pages. Other contributors may watch/subscribe to those pages and contribute ideas to them. If you can't find answers to your questions, ask first in those pages. If this doesn't work then go ahead and post your question to the wikitech-l mailing list.

Add your proposal

 * Use your user page to introduce yourself.
 * Draft your project in a separate page in main namespace, or as subpage of an existing project or extension your idea will integrate with. Try to pick a short, memorable and catchy title which communicates your core idea on how to tackle the issue/project you chose.
 * Use the template. For GSoC proposals, remember to add them to the proposals category and the table so that it's clear it's a proposal (not yet approved) and you're working on it.
 * The GSOC student guide is a good resource for anybody willing to write a good project proposal. And then there is a list of DOs and DON'Ts full of practical wisdom.

Featured project ideas
Below you can find a list of ideas that already have gone through a reality check and have mentors confirmed. You can find more suggestions in our list of Raw projects.

But before, let us talk about...

Your project
That's right! If you have a project in mind we want to hear about it. We can help you assessing its feasibility and we will do our best finding a mentor for it.

Here you have some guidelines for project ideas:


 * Opportunity: YES to projects responding to generic or specific needs. YES to provocative ideas. NO to trivial variations of existing features.
 * Community: YES to projects encouraging community involvement and maintenance. NO to projects done in a closet that won't survive without you.
 * Deployment: YES to projects that you can deploy. YES to projects where you are in sync with the maintainers. NO to projects depending on unconvinced maintainers.
 * MediaWiki != Wikipedia: YES to generic MediaWiki projects. YES to projects already backed by a Wikimedia community. NO to projects requiring Wikipedia to be convinced.
 * Free content: YES to use, remix and contribute Wikimedia content. YES to any content with free license. NO to proprietary content.
 * Free API: YES to the MediaWiki API. YES to any APIs powered with free software. NO to proprietary APIs.



Internationalization and localization
Internationalization (i18n) and localization (L10n) are part of our DNA. The Language team develops features and tools for a huge and diverse community, including 287 Wikipedia projects and 349 MediaWiki localization teams. This is not only about translating texts. Volunteer translators require very specialized tools to support different scripts, input methods, right-to-left languages, grammar...

Below you can find some ideas to help multilingualism and sharing of all the knowledge literally for everybody in their own language.

Extensive and robust localisation file format coverage
Translate extension supports multiple file formats. The formats have been developed "as needed" basis, and many formats are not yet supported or the support is incomplete. In this project the aim would be to make existing file formats (for example Android xml) more robust to meet the following properties: Example known bugs are 31331, 36584, 38479, 40712, 31300, 57964, 49412.
 * the code does not crash on unexpected input,
 * there is a validator for the file format,
 * the code can handle the full file format specification,
 * the code is secure (does not execute any code in the files nor have known exploits).

In addition new file formats can be implemented: in particular Apache Cocoon and AndroidXml string arrays have interest and patches to work on, but we'd also like TMX, for example. Adding new formats is a good chance to learn how to write parsers and generators with simple data but complicated file formats. For some formats, it might be possible to take advantage of existing PHP libraries for parsing and file generation. (More example formats other platforms support: OpenOffice.org SDF/GSI, Desktop, Joomla INI, Magento CSV, Maker Interchange Format (MIF), .plist, Qt Linguist (TS), Subtitle formats, Windows .rc, Windows resource (.resx), HTML/XHTML, Mac OS X strings, WordFast TXT, ical.)

This project paves the way for future improvements, like automatic file format detection, support for more software projects and extension of the ability to add files for translation by normal users via a web interface.


 * Skills: PHP, XML, aware how to write robust and secure PHP code
 * Mentors: Niklas Laxström, Siebrand Mazeland

One stop translation search


A Special:SearchTranslations page has been created for the Translate extension to allow searching for translations. However it has not been finished and it lacks important features: in particular, being able to search in source language, but show and edit messages in your translation language. The interface has some bugs with facet selection and direct editing of search results is not working properly. It is not possible to search by message key unless you know the special syntax, nor to reach a message in one click. Interface designs are available for this page.


 * Skills: Backend coding with PHP, frontend coding with jQuery, Solr/ElasticSearch/Lucene
 * Mentors: Niklas Laxström, Nik Everett (for the Elasticsearch part if needed)

Collaborative spelling dictionary building tool
There are extensive spelling dictionaries for the major languages of the world: English, Italian, French and some others. They help make Wikipedia articles in these languages more readable and professional and provide an opportunity for participation in improving spelling. Many other languages, however, don’t have spelling dictionaries. One possible way to build good spelling dictionaries would be to employ crowdsourcing, and Wikipedia editors can be a good source for this, but this approach will also require a robust system in which language experts will be able to manage the submissions: accept, reject, filter and build new versions of the spelling dictionary upon them. This can be done as a MediaWiki extension integrated with VisualEditor, and possibly use Wikidata as a backend.


 * Skills: PHP, Web frontend. Bonus: Familiarity with VisualEditor and Wikidata.
 * Mentors: Amir Aharoni, Kartik Mistry

Wikimedia Identities Editor
Mediawiki Community Metrics is a Wikimedia project which goal is to describe how the MediaWiki / Wikimedia tech community is doing.

Once the website with the metrics is reaching a first complete version, a web application to manage community identities is needed. A community member will access the web application and authenticate using OAuth or creating a new account. All the information about the member in Mediawiki Community Metrics will be presented, so the user can update her information, add new identities, the localization and so on.


 * Skills: Django or similar web framework to develop the application. OAuth and other authentication techs.
 * Mentors: Alvaro del Castillo, Daniel Izquierdo.

New media types supported in Commons
Wikimedia Commons a database of millions of freely usable media files to which anyone can contribute. The pictures, audio and video files you find in Wikipedia articles are hosted in Commons. Several free media types are already supported but there are more requested by the community, like e.g. X3D for representing 3D computer graphics or KML/KMZ for geographic annotation and visualization. Considerations need to be taken for each format, like security risks or fallback procedures for browsers not supporting these file types.


 * Skills: PHP at least. Good knowledge of the file type chosen will be more than helpful.
 * Mentors: Bryan Davis, ?.

Semantic MediaWiki
Semantic MediaWiki is a lot more than a MediaWiki extension: it is also a full-fledged framework, in conjunction with many spinoff extensions, and it has its own user and developer community. Semantic MediaWiki can turn a wiki into a powerful and flexible collaborative database. All data created within SMW can easily be published via the Semantic Web, allowing other systems to use this data seamlessly.

There are more than 500 SMW-based sites, including wiki.creativecommons.org, docs.webplatform.org, wiki.mozilla.org, wiki.laptop.org and wikitech.wikimedia.org.

Multilingual Semantic MediaWiki
Semantic MediaWiki would benefit from being multilingual-capable out of the box. We could integrate it with the Translate extension. This can be done in some isolated steps, but there is a need to list all the things in need of translation and define appproach and priority for each of them. Some of the steps could be:


 * Fix the issues that prevent full localisation of Semantic Forms.
 * Enhance Special:CreateForm and friends (all the Special:Create* special pages by Semantic Forms) to create forms that are already i18ned with placeholders and message group for Translate extension.
 * Make it possible to define translation for properties and create a message group for Translate extension, similar to what CentralNotice does (sending strings for translation to Translate message groups).
 * There are lot of places where properties are displayed: many special pages, queries, property pages. Some thinking is required to find out a sensible way to handle translations on all these places.
 * Currently In most wikis, properties names are supposed to be hidden to the user, e.g. queries results are usually shown in infobox-like templates (whose labels could in theory be localised as all templates).

Translate would be fed with the strings in need of translation. Localised strings/messages would be displayed based on the interface language, that in core every user can set on Special:Preferences and with ULS is made way easier to pick for everyone including unregistered users.

For real field testing, WikiApiary could be used, or at worst translatewiki.net (quick deployments, little SMW content).


 * Skills: PHP and web frontend, has used Semantic MediaWiki and Semantic Forms is a plus.
 * Mentors: Niklas Laxström, Yaron Koren.

Visual translation: Integration of page translation with VisualEditor
The wiki page translation feature of the Translate extension does not currently work with VisualEditor due to the special tags it uses. More specifically, this is about editing the source pages that are used as the source for translations, not the translation process itself. The work can be divided into three steps:
 * 1) Migrate the special tag handling to a more standard way to handle tags in the parser. This need some changes to the PHP parser for it to be able to produce wanted output.
 * 2) Add support to Parsoid and VisualEditor so that editing page contents preserves the structures that page translation adds to keep track of the content.
 * 3) Add to VisualEditor some visual aid for marking the parts of the page that can be translated.

This is likely to be a difficult project due to complexities of wikitext parsing and intersecting multiple different products: Translate, MediaWiki core parser, Parsoid, VisualEditor.

 
 * Skills: PHP, JavaScript, wikitext parsing
 * Mentors: general mentors + Niklas Laxström

Wikimedia Performance portal
MediaWiki and the ecosystem of services that support it in Wikimedia's production environment emit lots of performance timing data, such as the time it took to process some request or the speed of a network link. Much of this data is aggregated in two log aggregation systems with graphing capabilities. But the data is not well curated, mixing important metrics with unimportant ones. The data needs a curator! We have http://performance.wikimedia.org/ provisioned, and we'd like that space to feature some key performance metrics about the Wikimedia cluster, perhaps accompanied by some glosses that help readers interpret the data. (See gdash.wikimedia.org for an approximate system.) Ori Livneh, Senior Performance Engineer at the Wikimedia Foundation, will act as mentor. He will be happy to provide an overview of the data that is available, the means of accessing it, and the tooling available for plotting it. This task is suitable for anyone with interest in data analysis and performance analysis. Some facility with a language with good data analysis libraries like Python or R is desirable but not required. 

Make Wiktionary definitions available via the dict protocol
The dict protocol (RFC 2229) is a widely used protocol for looking up definitions over the Internet. We'd like to make Wiktionary definitions available for users. Doing that using the dict protocol would help drive the use and usefulness of Wiktionary, as well.

Possible users:
 * Tablet readers often have dictionary lookup included.
 * Students writing papers would have access to a large corpus of words.
 * Mobile applications for Wiktionary would be less tied to MediaWiki itself.


 * Skills: ?
 * Mentors: Yurik Astrakhan + ?

MediaWiki development
If you're a programmer, we have lots of things for you to do. (To do: copy some relevant ideas from http://socialcoding4good.org/organizations/wikimedia )

Effective anti-spam measures
Use something like a minimal version of Extension:ConfirmAccount to require human approval of each account creation. That is the applicant fills in forms for user name, email and a brief note about who they are and why they want to edit the wiki. Also set the wiki so that the initial few edits also need approval. Then have it that any bureaucrat can approve the account creation, initial edits and remove the user id from moderation. Rob Kam (talk) 09:50, 1 December 2013 (UTC)


 * Requirements have to be clarified here: the proposed approach is much more complex than ConfirmAccount, not "minimal". Perhaps what you want is a sandbox feature? --Nemo 10:32, 1 December 2013 (UTC)


 * Sandbox feature looks good, but for all new accounts not just translators. Rob Kam (talk) 10:49, 1 December 2013 (UTC)

Parsoid
The Parsoid project is developing a wiki runtime which can translate back and forth between MediaWiki's wikitext syntax and an equivalent HTML / RDFa document model with better support for automated processing and visual editing. It powers the VisualEditor project, Flow and semantic HTML exports.

Parser migration tool
Periodically, we come across some bit of wikitext markup we'd like to deprecate. See Parsoid/limitations, Parsoid/Broken wikitext tar pit, and (historically) MNPP for examples. We'd like to have a real slick tool to enhance communication with WP editors about these issues:
 * It would display a list of wiki titles (filtered by wikipedia project) which contain deprecated wikitext. Each title would link to a page which would briefly describe the problem(s), general advice on how the wikitext should be rewritten, and (perhaps) some previously-corrected pages for editors to look at.
 * Ideally this would be integrated with a wiki workflow and/or contain "revision tested" information so that editors can 'claim' pages from the list to fix and don't step on each others work. Fixed/revised pages would be removed from the list until their new contents could be rechecked.
 * It should be as easy as possible for Parsoid developers to add new "bad" pattern tests to the tool. These would get added to the testing, with appropriate documentation of the problem, so that editors don't have to learn about a new tool/site for every broken pattern.
 * Some of these broken bits of wikitext might be able to be corrected by bot. The tool could still create a tasklist for the bot and collect and display the bots' fixes for editors to review.
 * The backend which looks for broken wikitext could be based on the existing round-trip test server. Instead of repeatedly collecting statistics on a subset of pages, however, it would work its way through the entire wikipedia project looking for broken wikitext (and preventing regressions).
 * Some cleverness might be helpful to properly attribute bad wikitext to a template rather than the page containing the template. This is probably optional; editors can figure out what's going on if they need to.


 * Skills: node.js, and probably MediaWiki bots and/or extensions as well. A candidate will ideally have some node.js experiences and some notions of web and UX design.  This task could be broken into parts, if a candidate wants to work only on the front-end or back-end portions of the tool.
 * Mentors: C. Scott Ananian, Subramanya Sastry

VisualEditor plugins
VisualEditor is a rich visual editor for all users of MediaWiki so they don't have to know wikitext or HTML to contribute well formatted content. It is our top priority and you can already test it on the English Wikipedia. While we focus on the core functionality, you could write a plugin to extend it, such as to insert or modify Wikidata content. There are also many possibilities to increase the types of content supported, including sheet music, poems, timelines…


 * Skills: HTML / JavaScript / jQuery development is required. A good grasp of UX / Web design will make a difference.
 * Mentors: James Forrester, Roan Kattouw, Trevor Parscal.

VisualEditor support for EasyTimeline
Also mentioned at.

Flow
Flow brings a modern discussion and collaboration system to MediaWiki. Flow will eventually replace the current Wikipedia talk page system and will provide features that are present on most modern websites, but which are not possible to implement in a page of wikitext. For example, Flow will enable automatic signing of posts, automatic threading, and per-thread notifications.

Improving the skinning experience
Research how to make the development of skins for MediaWiki easier. Many users complain about the lack of modern skins for MediaWiki and about having a hard time with skin development and maintenance. Often sys admins keep old versions of MediaWiki due to incompatibility of their skins, which introduces security issues and prevents them from using new features. However, little effort was done to research the exact problem points. The project could include improving skinning documentation,organizing training sprints/sessions, talking to users to identify problems, researching skinning practices in other open source platforms and suggesting an action plan to improve the skinning experience.

Maria Miteva proposed this project.

Extensions
Check Manual:Extensions and extension requests in Bugzilla.

An easy way to share wiki content on social media services
Wikipedia, as well as other wikis based on MediaWiki, provide an easy way to accumulate and document knowledge, but it is difficult to share it on social media. According to https://strategy.wikimedia.org/wiki/Product_Whitepaper 84% of Wikimedia users were Facebook users as well in 2010, with the portion incresing from previous years. The situation is probably similar with other social media sites. It only makes sense to have an effective "bridge" between MediaWiki and popular social media site. More details here: Product Whitepaper.

Some previous work you can use as a base, improve, or learn from:


 * Extension:Widgets
 * Extension:WidgetsFramework - experimental extension
 * Extension:AddThis
 * Extension:Facebook - just Facebook
 * Extension:WikiShare - unstable version, seems like it's not worked on any more

Extension:OEmbedProvider
Finish Extension:OEmbedProvider, as proposed here. See also Bug 43436 - Implement Twitter Cards

Leap Motion integration with MediaWiki
MediaWiki has a wide user base and a lot of users today prefer touch based interfaces. Gesture based interface are friendly and the latest trend. Leap Motion provides controllers that can recognize gestures. It can be integrated with MediaWiki products like Wikisource. As an example, this would make it more friendly for users to flip through pages in a book. Another advantage of using gesture recognition would be to include turning through multiple chapters or pages at a time by identifying the depth of user's finger's motion.

It would also be helpful for flipping through images in Wikimedia Commons.

(Project idea suggested by Aarti Dwivedi).

Work on RefToolbar
The en:Wikipedia:RefToolbar/2.0 extension is incredibly useful, especially for new editors but also for experienced editors (I use it every day, and I've got a few miles under my belt!). But it suffers from bugs and problems, and there are a lot of improvements that could be made. For instance: adding additional reference types, adding fields for multiple authors, tool-tip help guidance, etc. I also suspect it will need an upgrade to match Lua conversions of common cite templates. Also, I don't think this is in wide deployment on other wikis, so translation/deployment could be a project. Looking at the talk page, there are a couple people starting to work on this but serious development isn't happening (so I'm not sure who would mentor this) but the code was recently made accessible. At any rate, it is an extension that really needs some work and where improvements would have immediate benefit for many editors.

Project idea contributed by Phoebe (talk) 23:23, 22 March 2013 (UTC) [n.b.: I can't mentor on the tech side, but can give guidance on the ins and outs of various citation formats in the real world & how cite templates are used on WP].

Global, better URL to citation conversion functionality
Suppose, in Wikipedia, all that needed to be done to generate a perfect citation was to provide a URL? That would be a tremendous step toward getting a much higher percentage of text in Wikipedia articles to be supported by inline citations.

There are already expanders (for the English Wikipedia, at least) that will convert an ISBN, DOI, or PMID, supplied by an editor, into a full, correct citation (footnote). These are in the process of being incorporated into the reference dialog of the VisualEditor extension, making it almost trivial (two clicks, paste, two clicks) to insert a reference.

For web pages, however, the existing functionality seems to be limited to a Firefox add-on. Its limits, besides the obvious requirement to use that browser (and to install the add-on), include an inability to extract the author and date from even the most standard pages (e.g., New York Times), and the lack of integration with MediaWiki.

For a similar approach, using a different plug-in/program, see this Wikipedia page about Zotero.

A full URL-to-citation engine would use the existing Cite4Wiki (Firefox add-on) code, perhaps, plus (unless these exist elsewhere) source-specific parameter specifications. For example, the NYT uses "" for its author information; that format would be known by the engine (via a specifications database). Each Wikipedia community would be responsible for coding these (except for a small starter set, as examples), in the way that communities are responsible for TemplateData for the new VisualEditor extension.

(Project idea suggested by John Broughton.)

Education Program, outreach and projects
The Wikipedia Education Program helps professors and students contribute to Wikipedia as part of coursework. The current Education Program extension provides features for keeping track of the institutions, courses, professors, students and volunteers involved in this. However, the extension has several limitations and will be largely rewritten. Help is needed to design and build new software to support both the Education Program and other related activities, including topic-centric projects and edit-a-thons.

This project offers tons of opportunities to learn about different facets of software development. There's work to be done right away on UX, flushing out details of requirements, and architecture design. On this last point, a fun challenge we'll face is creating elegant code that interfaces with a not-so-elegant legacy system. Another challenge will be to create small deliverables that are immediately useful, that can replace parts of the current software incrementally, and that can become components of the larger system we're planning.

Student developers eager to dive into coding tasks can also take bugs on the current version of the software&mdash;much of which will remain in production for a while yet. In doing so, they'll practice their code-reading skills, and will get to deploy code to production quickly. :)


 * Skills: PHP, Javascript, CSS, HTML, UI design, usability testing, and object-oriented design
 * Mentors: Andrew Green, Sage Ross.

Support for vizGrimoireJS-lib widgets
vizGrimoireJS-lib is a JavaScript visualization library for data about software development, collected by MetricsGrimoire tools. It is used, for example, for the MediaWiki development dashboard.

The idea of this project is to build a module for MediaWiki so that vizGrimoireJS-lib widgets can be included in MediaWiki pages, with a special markup. All widgets get their information from JSON files, and accept several parameters to control visualizacion. Currently, vizGrimoireJS-lib widgets can be included in HTML pages with a simple HTML markup (sing HTML attributes to specify the parameters of the widget. This behavior will be translated into MediaWiki markup, so that the whole MediaWiki development dashboard could be inserted in MediaWiki pages.


 * Skills: PHP, JavaScript
 * Mentors: Alvaro del Castillo, Daniel Izquierdo, Jesus M. Gonzalez-Barahona.

Modernize Extension:EasyTimeline
EasyTimeline hasn't gotten a lot of maintenance in the last few years, in part perhaps because the perl script and the ploticus dependency make it harder for MW devs to tweak. Bringing the graphics generation "inside" also could enable fancier future things, such as in-browser visual editing of timelines if it can be translated into something manipulable on the web such as SVG. (comment 0. See more talk in bugzilla) Some currently "visible" problems include inflexible font configuration, non-Latin scripts, right-to-left text and accessibility (not sure whether this is actually fixable).


 * Skills: Perl (reading would be enough), PHP, knowledge of whatever backend finally chosen
 * Mentors: (choose from bug commenters there?)

Allowing 3rd party wiki editors to run more CSS features
The 3rd party CSS extension allows editors to style wiki pages just by editing them with CSS properties. It could be more powerful if we find a good balance between features and security. Currently this extension relies on basic blacklisting functionality in MediaWiki core to prevent cross-site scripting. It would be great if a proper CSS parser was integrated and a set of whitelists implemented.

Additionally, the current implementation uses data URIs and falls back to JavaScript when the browser doesn't support them. It would be a great improvement if the MediaWikiPerformAction (or similar) hook was used to serve the CSS content instead. This would allow the CSS to be more cleanly cached and reduce or eliminate the need for JavaScript and special CSS escaping.


 * Skills: Web Application Security, PHP, CSS, JavaScript.
 * Mentors: ?.

Wikimedia Commons / multimedia
Sébastien Santoro (Dereckson) can mentor these projects idea.

Support for text/syntax/markup driven or WYSIWYG editable charts, diagrams, graphs, flowcharts etc.
Resuscitate Extension:WikiTeX and fold Extension:WikiTex into it.

Provide a way to create interactive 2D/3D timelines and infographics à la Java applets, AJAX, Flash
We almost surely don't want to invent our own markup, but SVG probably doesn't suffice and we surely won't use any proprietary format. Ideally we would adopt some syntax/format/technology already adopted and supported by a lively community, preferably offering a certain amount of existing timelines/infographics and other resources which we would then be able to directly use on Wikimedia projects, save copyright incompatibilities. Perhaps http://timeline.knightlab.com/, used by Reasonator?

Accessibility for the colour-blind
Commons has a lot of graphs and charts used on Wikipedia and elsewhere, but few consider how they look with colour blindness, mostly because the creator/uploader has no idea. Accessibility lists some tools that can be used to automatically transform images into how they are seen by colour blind people. We could run such automated tools on all Commons graphs and charts and reporting the results, ideally after assessing automatically in some way that the resulting images are not discernible enough, lower than some score. The warnings can be relayed with some template on the file description or directly to the authors and can havhe a huge impact on the usefulness of Commons media.

Depending on skills and time constraint, the project taker would do 1, 1-2 or 1-3 of these three steps: 1) develop the code for such an automatic analysis based on free software, 2) identify what are the images to check on the whole Commons dataset and run the analysis on it producing raw results, 3) publish such results on Commons via bot in a way that authors/users can notice and act upon.

Category suggestions
Let's crush the categorisation backlog once and for all!

some categorization could be automated allready.

Searching for pictures based on meta-data is called "Concept Based Image Retrieval", searching based on the machine vision recognized content of the image is called "Content Based Image Retrieval".

What I understood of Lars' request, is an automated way of finding the "superfluous" concepts or meta-data for pictures based on their content. Of course recognizing an images content is very hard (and subjective), but I think it would be possible for many of these "superfluous" categories, such as "winter landscape", "summer beach" and perhaps also "red flowers" and "bicycle".

There exist today many open source "Content Based Image Retrieval" systems, that I understand basically works in the way that you give them a picture, and they find you the "matching" pictures accompanied with a score. Now suppose we show a picture with known content (pictures from Commons with good meta-data), then we could to a degree of trust find pictures with overlapping categories. I am not sure whether this kind of automated reverse meta-data labelling should be done for only one category per time, or if some kind of "category bundles" work better. Probably adjectives and items should be compounded (eg "red flowers").

Relevant articles and links from Wikipedia:
 * 1) Image_retrieval
 * 2) Content-based_image_retrieval
 * 3) List_of_CBIR_engines

Some demo links bawolff found:


 * http://demo-itec.uni-klu.ac.at/liredemo/ (Lire might even be integrated with CirrusSearch because it's based on Lucene)
 * http://image.mdx.ac.uk/time/demo.php
 * http://mi-file.isti.cnr.it:8765/CophirSearch/
 * http://orpheus.ee.duth.gr/anaktisi/ (not free)
 * https://youtube.com/2eaGwk4Xhks


 * Skills: image recognition/analysis, possibly Natural language processing; language depending on implementation, e.g. python for a PWB tool, JavaScript and PHP for a MediaWiki extension, JavaScript for a tool similar to the Wikidata Game.
 * Possible mentors: Kristian Kankainen (Keeleleek); WereSpielChequers?.
 * I like the idea of automating categorisation, but I think we are a long way from being able to do much of it. So this would be a big longterm project. One of my concerns is that we are a global site, and we are trying to collect the most diverse set of images that anyone has ever assembled. Image recognition is a good way of saying that we now have another twenty images of this person, but it could be confused when we get our first images of one of the fox subspecies that we don't yet have a picture of. Or rather it would struggle to differentiate the rare and the unique from their more common cousins. There are also some spooky implications for privacy re image recognition and our pictures of people, aside from the obvious things like identifying demonstrators in a crowd or linking a series of shots of one person in such a way as to identify that this photograph of a face belongs to the same person as this photo of pubic hair because the hand is identical; We have had some dodgy things happening on Wikipedia with people wanting to categorise people ethnically and I worry that someone might use a tool such as this to try and semi accurately categorise people as say Jewish. Another major route for improved categorisation is geodata, and I think this could be a less contentious route. Not that everything has geodata, but if things have it could be a neat way to categorise a lot of images, especially if we can get boundary data so we can categorise images as being shot from within a set of boundaries rather than centroid data with all its problems that the parts of one place maybe closer to the centre of an adjacent place than the centre of the area they belong to. WereSpielChequers (talk) 09:58, 20 June 2014 (UTC)

Removing inline CSS/JS from MediaWiki
One of the future security goals of MediaWiki is to implement Content Security Policy. This is an HTTP header that disallows inline JavaScript and CSS as well as scripts and styles from disallowed domains. One of the big steps to achieving this is to remove all inline CSS and JavaScript from MediaWiki HTML. Some of the places inline scripting/styling is used: Fixing all of these inline scripts and styles is too big a task for a single mentor program. However, working on one or two, and slowly chipping down on the inline JS and CSS can help to move closer toward the final goal. This project obviously requires, at the very least, basic HTML and JavaScript knowledge, but some parts are more difficult than others. For example, bullet points 2 and 3 require only basic MediaWiki knowledge, but bullet point 1 requires altering the Parser class, and thus mandates a deeper understanding of MediaWiki and how it parses wikitext.
 * Inline styling in wikitext is translated to inline styling in HTML
 * ResourceLoader is mostly good, but the loader script (at the top and bottom of page) is inline JavaScript
 * Data such as user preferences and ResourceLoader config variables is embedded into the HTML as inline JSON, when it should be in HTML attributes
 * Many extensions use inline styling rather than ResourceLoader modules

Merge proofread text back into Djvu files
Wikisource, the free library, has an enormous collection of Djvu files and proofread texts based on those scans. However, while the DjVu files contain a text layer, this text is the original computer generated (OCR) text and not the volunteer-proofread text. There is some previous work about merging the proofread text as a blob into pages, and also about finding similar words to be used as anchors for text re-mapping. The idea is to create an export tool that will get word positions and confidence levels using Tesseract and then re-map the text layer back into the DjVu file. If possible, word coordinates should be kept.

Hopefully, it will be possible to reuse part of existing proofreading/OCR correction/OCR training software such as the OCR editor by the National Library of Finland.

See also m:Grants:IdeaLab/Djvu text layer editor.

Mentors and skills:
 * Project proposed by Micru. I have found an external mentor that could give a hand on Tesseract, now I'm looking for a mentor that would provide assistance on Mediawiki.
 * Aubrey can be a mentor providing assistance regarding Wikisource, and some past history of this issue. Not much, but glad to help if needed.
 * Rtdwivedi is willing to be a mentor.

Distributed cron replacement
A common requirement in infrastructure maintenance is the ability to execute tasks at scheduled times and intervals. On Unix systems (and, by extension, Linux) this is traditionally handled by a cron daemon. Traditional crons, however, run on a single server and are therefore unscalable and create single points of failure. While there are a few open source alternatives to cron that provide for distributed scheduling, they either depend on a specific "cloud" management system or on other complex external dependencies; or are not generally compatible with cron.

The Wikimedia Labs has a need for a scheduler that:
 * Is configurable by traditional crontabs;
 * Can run on more than one server, distributing execution between them; and
 * Guarantees that scheduled events execute as long as at least one server is operational.

The ideal distributed cron replacement would have as few external dependencies as possible.

&mdash; Coren (talk)/(enwp) 19:29, 23 November 2013 (UTC)

System documentation integrated in source code
It would be really nice if inline comments, README files, and special documentation files could exist in the source code but be exported into a formatted, navigable system (maybe wiki pages or maybe something else). It could be something like doxygen, except better and orientated to admins and not developers. Of course it should integrate with mediawiki.org and https://doc.wikimedia.org.

The idea would be that one could:
 * Keep documentation close to the code and thus far more up to date
 * Even enforce documentation updates to it with new commits sometimes
 * Reduce the tedium of making documentation by using minimal markup to specify tables, lists, hierarchy, and so on, and let a tool deal with generating the html (or wikitext). This could allow for a more consistent appearance to documentation.
 * When things are removed from the code (along with the docs in the repo), if mw.org pages are used, they can be tagged with warning box and be placed in maintenance category.

Proposed by Aaron Schulz.

Ranking articles by pageviews for wikiprojects and task forces in languages other than English
Currently we have an amazing tool which every month determine what pages are most viewed for a Wikiproject and then provides a sum of the pageviews for all articles within that project. An example of the output for WikiProject Medicine in English.

The problems is that this tool only exists in English and is running on toolserver rather than Wikimedia Labs. So while we know what people are looking at in English, and this helps editors determine what articles to work on, other languages do not have this ability.

Additionally we are do not know if the topics people look up in English are the same as those they look up in other languages. In the subject area of medicine this could be the basis of a great academic paper and I would be happy to share authorship with those who help to build these tools.

A couple of steps are needed to solve this problem:
 * 1) For each article within a Wikiproject in English, take the interlanguage links stored at wikidata, and tag the corresponding article in the target language
 * 2) Figure out how to get Mr. Z's tool to work in other languages . He supposedly is working on it and I am not entire clear if he is willing to have help. Another tool that could potentially be adapted to generate the data is already on Labs

James Heilman (talk) 21:13, 14 September 2013 (UTC)

Improving MediaWikiAnalysis
MediaWikiAnalysis is a tool to collect statistics from MediaWiki sites, via the MediaWiki API. It is a part of the MetricsGrimoire toolset, and it is currently used for getting information from the MediaWiki.org site, among others.

The stats currently collected by MediaWiki are only a part of what is feasible to collect, and the tool itself could be improved. Some possible directions:


 * 1) Explore in detail the MediaWiki API and extract as much information from it as possible.
 * 2) Improve efficiency and incremental retrieval of data
 * 3) Propose (and if possible, implement) changes to the MediaWiki API if needed, to support advanced collection of data.
 * 4) Use SQLAlchemy instead of MySQLdb for managing the MediaWikiAnalysis database.


 * Skills: Python, SQL, PHP (only in case of proposing changes to MediaWiki API)
 * Mentors: Alvaro del Castillo, Daniel Izquierdo, Jesus M. Gonzalez-Barahona

Beyond development
Featured projects that focus on technical activities other than software development.

Research & propose a catalog of extensions
Extensions on mediawiki.org are not very well organized and finding the right extension is often difficult. Listening community members you will hear about better management of extension pages with categorization, ratings on code quality, security, usefulness, ease of use, good visibility for good extensions, “Featured extensions”, better exposure and testing of version compatibility... This project is about doing actual research within our community and out there to come up with a proposal both agreed and feasible. A plan that a development team can just take to start the implementation.


 * Skills: research, negotiation, fluent English writing. Technical background and knowledge of MediaWiki features and web development features will get you sooner to the actual work.
 * Mentors: Yury Katkov + ?

Simultaneous Modification of Multiple Pages with Semantic Forms
Right now the editing of multiple pages with Semantic Forms is rather cumbersome with users having to edit every page separately, then sending it off and waiting for the server reply to then click their way to the edit form for the next page. The aim of this project is to facilitate the simultaneous editing of the data of multiple pages displayed in a table, ideally giving a spreadsheet-like experience.

As an additional goal there should be an autoedit-like functionality for multiple pages. Using the #autoedit parser function it is currently possible to create links that, when clicked on, create or edit one page automatically in the background, with a preloaded set of values. With the new function it would be possible to modify several pages at once.

Project goals:
 * display data of multiple pages in a tabular form with each line containing the data of one page and each cell containing an input for one data item
 * provide an optimized user interface for this form that allows for rapid navigation and editing with a special focus on keyboard navigation
 * optional: for the data items use the input widgets as specified in an applicable form definition
 * when submitted store the modified data using the job queue
 * provide a parser function that allows the automatic modification of multiple pages

This project involves challenges regarding working with the MediaWiki API and user rights management to protect the wiki from unauthorized mass-modification of pages.


 * Skills: PHP, JavaScript
 * Mentors: Stephan Gambke, Yaron Koren

Very raw projects

 * Taken from the former "Annoying large bugs" page.


 * Making our puppet servers HA and load balanced without having to change all of the security certificates


 * A proper extension (rather than gadget) to manage and run the Commons Picture of the Year contest
 * Runs in May of each year
 * see mailing list discussion for specifications


 * Making a ReCAPTCHA-like solution that helps with proofreading for Wikisource
 * Help the developers on GitHub


 * Make frequently updated special pages like Recent Changes update themselves automatically on the client


 * "An extension that manages all templates... It could be like a library of templates, only now they can be easily imported to other wikis."


 * Global user preferences
 * architecturally very important (if not critical) to a number of projects
 * As far as I remember, the backend for this is largely completed, it just needs a sensible UI
 * Andrew Garrett writes:
 * I tried to implement this when I completely refactored the preferences system in 2009. It was eventually reverted in r49932. The main blocker was basically considering a way to decide which preferences would have their values synchronised. A UI would need to be developed for that and you'd need some extensive consultation on that fact.
 * If you were to implement this, you could potentially use my original implementation as a guide, though it is reasonably "in the guts" of MediaWiki so you'd have to be reasonably confident "code diving" into unfamiliar software packages.


 * using onscreen keymaps from Narayam's code base to build a mobile-focused app where one could choose and load a keymap from. This would be a great app to have on the mobile app stores for Boot2Gecko and Android.


 * Moving categories (and updating members of it) members
 * Discussed endlessly; needs to be properly implemented; move support, etc. should already be there; just needs schema support and some software tweaks?


 * HTML e-mail support
 * Requires some design expertise, but it'd be nice to have MediaWiki e-mails stop looking as though they're from 1995, especially as they're much more visible nowadays with ENotif (email notifications) enabled on Wikimedia wikis
 * Some of this was done as part of Notifications.


 * Fix user renames to be less fragile and horrible
 * Lots of breakages from renames of users with a lot of edits; old accounts need to be fixed in a sensible way and new borkages need to be properly prevented


 * Let users rename themselves
 * Restrict to those with zero edits?
 * Or not?
 * Major community policy issues.


 * Group similar pages in watchlist
 * Possibly a gadget? Make it easier for editors to sort through their watchlists.
 * See en:User:Js/watchlist.js


 * Database layer should automagically add GROUP BY columns on backends that need them (PostgreSQL)
 * Will improve the MediaWiki installation and maintenance experience for all MediaWiki installations on PostgreSQL, SQL Server, Oracle, SQLite, or any other non-MySQL database.
 * Needs some database experience


 * Add user preference to deactivate/delete user account
 * might be possible to piggyback on the HideUser functionality that already exists in MediaWiki core


 * Add a read-only API for CentralNotice