Extension:Translate/Mass migration tools

This is a proposal for GSoC 2014

Identity
Name: Pratik Lahoti Email: pr4tiklahoti@gmail.com Project title: Tools for mass migration of legacy translated wiki content

Contact/working info
Timezone: UTC+5:30 (IST - India) Typical working hours: 10:00 AM to 2 PM (IST) and 8:00 PM to 2:00 AM (IST) (however, can adjust and go beyond if required) IRC or IM networks/handle(s): BPositive (freenode) ''I have a stable internet connection at home without any proxy issues. So, connectivity won't be an issue at any point of time''

Project Outline
The MediaWiki Translate extension has a page translation feature that allows structured translation of wiki pages. It makes the task of translation easier by providing a user-friendly interface that consists of text strings splitted into translation units. Non-translatable content like images are excluded from the process of translation. Though that makes the job easy with a rich editor supporting various languages, a lot of effort is required to prepare the page for translation, i.e, the page under question first needs to be converted into a format that would be recognized by the Translate extension. The process of preparing the page for translation needs to take into consideration various markups which becomes a tedious task when done manually. Plus, wikis have a lot of legacy content that still needs to be made translatable.

Thus, with this motivation, the project aims to facilitate this conversion and thereby save manual time and effort. The tool developed would thus make the page translatable, and once that is done, it would import the translations (which were present before the page was made translatable) into the Translate extension. Thus, the entire process of importing translations which involved a significant amount of manual task gets automated by this project.

Bug on Bugzilla
Bug #46645

Thread on Mailing List
http://lists.wikimedia.org/pipermail/mediawiki-i18n/2014-March/000820.html

Mentors
Niklas Laxström and Federico Leva are my mentors for this project.

The approach
The project aims at creating a tool which will essentially work as follows:

Step 1: Make the page translatable: This would involve -
 * 1) Getting the raw source text of the page in question. This can be done using the Parsing API
 * 2) Adding the tag at the top of the page
 * 3) Adding the &lt;translate>.....&lt;/translate> tags
 * 4) Converting the various elements into their translatable equivalents as per the markup examples. (Basically, to cut it short, this step of the tool would convert the manual document into code, as per my understanding)
 * 5) Once all the necessary changes have been made, the page under question would be saved with the updated text.

The page is now ready for translation. The user can mark the page for translation now, after which the Translation Extension will perform its job of breaking it into Translation units. Step 2: Import the translations: This would involve
 * 1) Getting the list of languages in which the translation exists for the page under question
 * 2) For each of the languages:
 * 3) Grab the latest version before of the page before FuzzyBot's edit
 * 4) Copy the translations present in that version unit by unit to the Translate Extension. This can be done either by comparing the lengths, checking if the string contains links which point to the same page. That would indicate a match and a series of (un)checkboxes and/or matching/mapping slides can be provided to the user. Another way of doing this would be to translate the English text using a machine translation interface provided by the Content Translation extension and compare it with the translated version we have. Performance, accuracy and usability will be the deciding factors.
 * 5) Create corresponding page of the format   and save it.

By the end of this step, the translations have been imported into the Translate Extension. Each of these sub-steps would involve considering various possibilities and corner-cases, which would be handled as the project progresses.

Deliverables

 * User of the system: Translation administrator
 * Name of the bot: MigrationBot

One of the possible ways of implementing this project would be to have a series of confirmation screens for the end user to validate what the 'MigrationBot' would be doing. If implemented this way, the deliverables would include:


 * 1) The MigrationBot: The bot would do the backend work, as mentioned above in 'The approach' section.
 * 2) A user interface asking for confirmation, highlighting the changes done in "Step 1 - Preparing the page for translation". (or, this can also be broken into several small dialogs showing each step performed by the bot)
 * 3) A user interface asking for confirmation for the imported translations in "Step 2".

This section would see additions/deletions depending on the change in user requirements

Use Cases


This section would expand depending on the feedback on the proposal

Project Schedule
The schedule will be planned once the requirements and deliverables are freezed

About me
I am Pratik Lahoti, from Pune, India. I am a final year Information Technology student from College of Engineering, Pune. I am known on all wiki's as BPositive, and that's the attitude I carry with me! :) My journey on Wikipedia started as an editor. I was later selected as the Campus Ambassador for Wikipedia's India Education Program. I have also coordinated the WikiProject Indian Collaboration of the month whereby I carried out collaborations with editors from all over India. Owing to my contributions, I was fortunate to be a featured Wikimedian for the month of April 2012 on Wikimedia India. Contributing to FOSS projects, attending seminars, and involving more and more people in the movement has always been my passion!

I have been involved with the Language Engineering Team at hackathons held at Gnunify and their enthusiasm has been infectious! Hence, I would like to work with them.

Participation
IRC has always been my first option whenever I am stuck at something. I never hesitate to ask questions, however silly they might be. I would be available on IRC by the nickname BPositive on channels such as #mediawiki, #mediawiki-i18n, #wikimedia-dev. I am also suscribed to different mailing lists such as wikitech-l, Mediawiki-India and I go through the discussions. I would appreciate all discussions related to my project to be carried on the above mentioned IRC channels and mailing lists.

I have a blog where I can update all the progress of my proposed project. I would also write monthly reports on MediaWiki itself to keep the community updated about the project.

Past Open Source Experience
I have been a promoter of FOSS in my college. I have conducted workshops for my juniors on subjects like "Introduction to FOSS", "Introduction to Vim", "Getting started with Git". I am also an active member of Google Developers Group Pune and have participated in a couple of Hackathons and also attended several workshops. I have been attending Gnunify for the past four consecutive years and have interacted with many people from the Open Source community. I have also participated in the Translation sprint held by the Language Engineering Team at Gnunify'13 and carried out translations for Marathi (mr) language.

Any other info
Project Experience
 * Solved some bugs related to the Translate extension
 * Participated in Google Cloud Developer Challenge 2013 and one of the developers of Direct2Drive
 * Internship at Eaton Corporation - Developed a time tracking application for the employees of Eaton
 * "Personalized News Reader", a web application which uses Facebook page likes and Twitter followers to generate relevant news for the user. Also, I used Neo4j - a graph database for this project.
 * And many other pet projects