Extension:Translate/Mass migration tools

Project Outline
The MediaWiki Translate extension has a page translation feature that allows structured translation of wiki pages. It makes the task of translation easier by providing a user-friendly interface that consists of text strings split into translation units. Non-translatable content (for example, images) is excluded from the process of translation. Though that makes the job easy with a rich editor supporting various languages, a lot of effort is required to prepare the page for translation, i.e, the page under question first needs to be converted into a format that would be recognized by the Translate extension. The process of preparing the page for translation needs to take into consideration various markups which becomes a tedious task when done manually. Plus, wikis have a lot of legacy content that still needs to be prepared for translation.

Thus, with this motivation, the project aims to facilitate this conversion and thereby save manual time and effort. The tool developed would thus make the page translatable, and once that is done, it would import the translations (which were present before the page was made translatable) into the Translate extension. Thus, the entire process of importing translations, which involved a significant amount of manual effort gets automated by this project.

Mentor's summary

 * Copied from Nemo's IRC office hours/Office hours 2014-08-19 while a wrap-up report is pending.

Hello all! BPositive/Pratik Lahoti, mentored by me and Nikerabbit/Niklas Laxström, has worked on Translate extension's Mass migration tools. ur1.ca/i09ny The problem: each multilingual wiki, like Meta and its spin-off mediawiki.org, in 10 years before Translate, accumulated thousands of translation pages and hundreds of thousands of strings which we need to manually migrate for Translate adoption.

The goal: make extremely hard work slightly less hard. The approach: a special page to import old translations, where copy-and-paste work is made less wrist-damaging by simple markup heuristics; and one to reduce manual basic preparations for new pages, with a set of regexes.

Hard work means few users, but important ones to support: they're happy to see this, but so far little usage other than myself. You can see it used e.g. at ur1.ca/i09li, or just try yourself on a Translate wiki, e.g. at https://www.mediawiki.org/wiki/Special:PageMigration / PagePreparation. The tool is currently restricted to translation administrators.

We tried hard to stick with MediaWiki best practices and have all work documented on bugzilla ur1.ca/i09m9, so you can easily see what the work is and was about. ur1.ca/i09mr June was the most intense month and saw PageMigration deployed, then BPositive started a full time job and we slowed down a lot. The basic pieces are in place, but there are some features missing and this means that on a certain amount of pages the feature will be less useful than we'd like it to be.

As of November 2014, we know of around ten users in mediawiki.org and Meta-Wiki, who used the tool to make over 4000 edits.

Bug on Bugzilla

 * General issue: Bug #46645
 * Tickets for the project: (by priority)
 * Help testing and give feedback
 * Use http://pagemigration.wmflabs.org
 * Report a bug:
 * Watch this page and |comment on talk

Thread on Mailing List
Announcement on Wikitech-l

Participation
IRC has always been my first option whenever I am stuck at something. I never hesitate to ask questions, however silly they might be. I would be available on IRC by the nickname BPositive on channels such as #mediawiki, #mediawiki-i18n, #wikimedia-dev. I am also suscribed to different mailing lists such as wikitech-l, mediawiki-i18n, translators-l, Mediawiki-India and I would appreciate all discussions related to my project to be carried on the above mentioned IRC channels and mailing lists.

I have a blog where I can update all the progress of my proposed project. I would also write weekly/monthly reports on MediaWiki itself to keep the community updated about the project.

The approach
The project aims at creating a tool which will essentially work as follows:

Step 1: Make the page translatable: This would involve -
 * 1) Getting the raw source text of the page in question. This can be done using the Parsing API
 * 2) Removing the   template, if present, and adding the  tag at the top of the page
 * 3) Adding the  tags
 * 4) Converting the various elements into their translatable equivalents as per the markup examples. (Basically, to cut it short, this step of the tool would convert the manual document into code, as per my understanding)
 * 5) Adding   for all the Categories
 * 6) Once all the necessary changes have been made, the page under question would be saved with the updated text, upon user confirmation.

The page is now ready for translation. The user can mark the page for translation now, after which the Translation Extension will perform its job of breaking it into translation units.

Step 2: Import the translations: This would involve
 * 1) Getting the list of languages in which the translation exists for the page under question
 * 2) For each of the languages:
 * 3) Grab the latest version of the page before FuzzyBot's edit
 * 4) Copy the translations present in that version unit by unit to the Translate Extension. This can be done either by comparing the lengths, checking if the string contains links which point to the same page. That would indicate a match and a series of (un)checkboxes and/or matching/mapping slides can be provided to the user. Another way of doing this would be to translate the English text using a machine translation interface provided by the Content Translation extension and compare it with the translated version we have. Performance, accuracy and usability will be the deciding factors. See also /Design/.
 * 5) Create corresponding page of the format   and save it.

By the end of this step, the translations have been imported into the Translate Extension.

Each of these sub-steps would involve considering various possibilities and corner-cases, which would be handled as the project progresses.

Deliverables

 * Further refined at /Requirements/


 * User of the system: Translation administrator


 * 1) The Migration Tool: The tool would do the backend work, as mentioned above in 'The approach' section. This can be implemented -
 * 2) As a server side PHP script: If implemented as a PHP script, the tool would be called as the 'MigrationBot', and all the edits would be thus saved under the bot's name. The script would be a part of the Translate Extension code base. This has the advantage of best access to data, but at the same time has the drawbacks of difficulty in deployment and delay in testing.
 * 3) As a JavaScript gadget: If developed as a JS gadget, the edits would be made under the user's account. This has the advantage of easy access to data. Deployment and testing would be faster compared to the first method.
 * 4) A user interface asking for confirmation, showing the changes done in "Step 1 - Preparing the page for translation". This can be accomplished by:
 * 5) Simply re-offering the editing window to the user. The user would review the changes done and save the text using the "Save Page" button
 * 6) Offering the editing window integrated with some syntax highighting editor like CodeEditor. This would make the job of a TA easier by highlighting the changes made by the tool
 * 7) A series of confirmation dialogs showing each step performed by the tool. Though this would make it easier for the user to review the changes, it can get annoying at times. Plus, verifying each of the sub-tasks performed and then combining would make the otherwise simple task complicated
 * 8) A user interface asking for confirmation for the imported translations in "Step 2". The interface would aim at eliminating the tab switching + copy-pasting task and thereby would have the source language text and the target language text besides each other with "Yes" and "No" buttons.
 * 9) Upon selecting "Yes", the imported translation would be saved by creating the corresponding page
 * 10) Upon selecting "No", the blocks on the right hand side would be allowed to split or drag and drop to match the corresponding blocks.

Use Cases
The following events would occur during migration of the page:
 * 1) The tool would be invoked in one of the following ways:
 * 2) Placing a link under the "Tools" section in the left hand side panel
 * 3) Providing a button in the editing window
 * 4) Confirmation screens for step 1 would be shown
 * 5) The user would confirm the changes made in the text and then mark the page for translation
 * 6) The Translate extension would split it into translatable units
 * 7) Upon confirming the splitted units clicking the "Save page" button, one of the following will occur:
 * 8) The step 2 would be triggered and imports will commence
 * 9) A dialog box saying something like, "It seems this page already had some translations. Do you want to import them to this system now?" would be shown. If 'Yes', the imports will commence and if 'No', a link under Tools would be added for access in the future (or a message at the top of the page in).
 * 10) Confirmation screen(s) for step 2 would be shown

The same has been depicted in the sequence diagram below:

If time permits
If things go as planned and I still have some time, I would be completing one or more of the following task(s):
 * 1) Implementing an automatic feedback system for the second part of the project. A database would be created, which would store user feedback on how they think the tool should have detected the unit correctly. This feedback would help in tweaking the detection algorithm for step 2.
 * 2) Allow the imported translations to be corrected in the confirmation screen itself by making them editable. This would help to complete the process in the confirmation screen itself rather than going to Special:Translate.

Project Schedule
Notes:
 * 1) During the Requirement Gathering Phase, I will keep working on the next task in parallel
 * 2) After July 7, I might have reduced working hours (30 hours per week), as I will be moving to a new city. I will be working after my office hours, but I ensure that the tasks planned will be completed on time. The project schedule has been planned accordingly.
 * 3) I plan to write the documentation and unit tests as the project goes along.

Identity
Name: Pratik Lahoti Email: pr4tiklahoti@gmail.com Project title: Tools for mass migration of legacy translated wiki content

Contact/working info
Timezone: UTC+5:30 (IST - India) Typical working hours: 10:00 to 14:00 (IST) and 20:00 to 2:00 (IST) (however, can adjust and go beyond if required) IRC or IM networks/handle(s): BPositive (freenode) ''I have a stable internet connection at home without any proxy issues. So, connectivity won't be an issue at any point of time''

Mentors
Niklas Laxström and Federico Leva are my mentors for this project.

About me
I am Pratik Lahoti, from Pune, India. I am a final year Information Technology student from College of Engineering, Pune. I am known on all wikis as BPositive, and that's the attitude I carry with me! :) My journey on Wikipedia started as an editor. I was later selected as the Campus Ambassador for Wikipedia's India Education Program. I have also coordinated the WikiProject Indian Collaboration of the month whereby I carried out collaborations with editors from all over India. Owing to my contributions, I was fortunate to be a featured Wikimedian for the month of April 2012 on Wikimedia India. Contributing to FOSS projects, attending seminars, and involving more and more people in the movement has always been my passion!

I have met and worked with the WMF Language Engineering Team at hackathons held at Gnunify and their enthusiasm has been infectious! Hence, I would like to work with them.

Past Open Source Experience
I have been a promoter of FOSS in my college. I have conducted workshops for my juniors on subjects like "Introduction to FOSS", "Introduction to Vim", "Getting started with Git". I am also an active member of Google Developers Group Pune and have participated in a couple of hackathons and also attended several workshops. I have been attending Gnunify for the past four consecutive years and have interacted with many people from the Open Source community. I have also participated in the Translation sprint held by the WMF Language Engineering Team at Gnunify'13 and carried out translations for Marathi (mr) language on translatewiki.net.

Acknowledgment
I would like to thank my mentors, Niklas Laxström and Federico Leva for their valuable assistance in drafting this proposal and their time to time suggestions. I would also like to thank Sumana and Quim for helping me polish this proposal and ensuring that I complete the tasks on time. Finally, I would like to thank all the community members who provided help/feedback on the discussion page as well as on IRC.

Any other info
I am positive about the project getting over by the end of the program but due to some unavoidable circumstances, if I am not able to do so, I will unconditionally work on it after the program gets over and complete it. I hereby announce that once the project is over, I would like to take the responsibility of the tool developed and thereby maintain it, address bugs and any other concerns from the community.
 * Microtasks performed
 * Solved some bugs related to the Translate extension
 * Wrote a UserScript which returns the revision before FuzzyBot's edit on the page, if at all it exists. This would be the first step of the second part of the project - importing translations.
 * Manually migrated wiki pages (and will be doing more before project starts)
 * Project Experience
 * Participated in Google Cloud Developer Challenge 2013 and one of the developers of Direct2Drive
 * Internship at Eaton Corporation - Developed a time tracking application for the employees of Eaton
 * Personalized News Reader, a web application which uses Facebook page likes and Twitter followers to generate relevant news for the user. Also, I used Neo4j - a graph database for this project.
 * Emergency plan
 * Post GSoC plans