Extension talk:Translate/Mass migration tools


 * The tool would be something independent. It won't be an extension.
 * Base suggests to have a button in the editing window, which will trigger the tool and the TA will review the changes done there itself before saving.
 * Positioning of the languages template should take into consideration the case when the translatable page has some divs. The tag must be placed appropriately in this case, for it to be rendered properly
 * Base also raised concerns on splitting of lists. Lists should be splitted on some parameter (like length). Also, the case when next list item continues a sentence from the previous one needs to be given a thought.
 * The case when the pages were translated long back (say, 5 years before). The English content might have additional paragraphs which were untranslated. This also needs to be taken into consideration.
 * The second step should have an interface of a diff style (which eliminates the tab switching)
 * Haitham's GrantsBot is a Python script which does something similar. It can be a good starting point for me.
 * The tool would be usable on all Wiki's outside Wikimedia having the Translate extension. It would go with the next release, if everything works smooth on MediaWiki and Meta.
 * Templates which have translations need to wrapped in
 * The case of internal links to sections of the same page needs special handling. It should link to the language-independent anchor.
 * Kaganer suggests all internal and external links should normalize, converting them into "two-party" form. The link address should be wrapped in "tvar" markup.
 * "Special:MyLanguage/" prefix for internal links

BPositive (talk) 04:09, 16 March 2014 (UTC)

Feedback
Ok, this looks good enough a draft to seek more comments, especially as you need to write the "use cases" section and that can only be done by hearing from users: so far you only had the point of view of one translation admin, me; I can "represent" most of the concerns from MediaWiki.org, Meta-Wiki and Commons translation admins but not everything. Stakeholders/venues to contact, where you can post an invite to come check your proposal, comment it and watchlist it: --Nemo 12:45, 2 March 2014 (UTC)
 * Project talk:Language policy, Project:Current issues
 * OSM
 * userbase.kde.org and other wikis of the family (whatever place they use to discuss)
 * m:Meta talk:Babylon
 * commons:Commons:Translators' noticeboard
 * wikidata:Wikidata:Translators' noticeboard
 * mediawiki-i18n, extension talk:Translate
 * Other wikis, among the 100+ using Translate, which have a lot of old translations

Feedback plus suggestions
Having read your proposal, I think it is worth being implemented.

Question: How to implement it?
 * As an extension inside MediaWiki?
 * Somewhat indepentent, such as with the Pywikibot framework?

A workflow question: Importing existing translations (i.e. step 2) likely often needs to be done by people who can read and understand these translations well enough. It may take a long time to find these experts. What happens meanwhile, so as to not hamper translating to other languages? Is it possible to have a consistent mix of unsplit translations while other language pages and the source page are split already?

Ideas and suggestions: --Purodha Blissenbach (talk) 11:37, 10 March 2014 (UTC)
 * First time splitting a source page into translateable units is language dependant. At least it depends on language types and writing systems. I would suggest to create some very basic code for it, and then only implement English, and thus likely some other Latin script based European languages, to begin with. English would be the most predominant use case anyways.
 * Splitting strategies vary on text types. Thus allow users to choose the best one. At least, I would suggest to have "by sentence", plus "by paragraph", plus, of course, what existing markup may suggest.
 * Hi, thank you very much for your feedback and suggestions.
 * It would be implemented as something independent, it won't be an Extension inside MediaWiki. I am fine with both PHP and Python, but the "Skills required" part of the project listed "PHP" as a required skill. Basically, it would be a bot asking for confirmation when needed.
 * I am not sure about the relevance of the splitting strategy here, as that is already the job of Translate extension. The Translate extension does the job of splitting into units, once the page is prepared and marked for translation, which is what the first part of this project is all about. And then, the step 2 would import the translations already present before Fuzzy Bot's edit (example). Right now, this importing job is tedious copy-paste work and the person doing that need not know the various languages. Having the source text (English) and the translated page before Fuzzy bot's edit is sufficient enough to do that. The second part is about automating this work. Hope this clears the workflow.
 * BPositive (talk) 14:37, 10 March 2014 (UTC)
 * Yes, it does. Thank you! --Purodha Blissenbach (talk) 23:16, 10 March 2014 (UTC)

Base's feedback
The idea is very nice and any tool that would ease the marking work is much appreciated as it takes a long time to tag properly even an easy-to-tag page. And thank you with all my heart for working on making TAs' life easier. But unfortunately such a tool just may be as some button in the editing window which would make preparation but leaving the final strokes for a TA. I just can't see how it can be another way. E.g. let's look at the 2 item in the algorithm - putting languages-tag at the to - nice in most cases - but sometimes translatable pages have some fancy divs and we must put it in that container to have the languages list rendered properly. Or how can It know when a list should be splitted to a line per a unit or not. E.g. if it's just a very short lines and the list is unlikely to be updated often - than it's OK to keep it a single unit. If list items are long stuff than we'd better split them to many units (as e.g. myself as a translator hate translating long lists in one unit). But there happens a case where the next list item continues a sentence from the previous one - in that case we should keep that items in one unit surely if we don't want to get a mess in translated pages (e.g. when one translator translates first part and another last part and it yields in having hardly understandable structure for a reader.). Step 2 seems to me even more longing on the one hand as it's probably the hardest part of TAs life to move that old translations to their new place especially when you deal with a long page with over 10 finished translations. But on the other hand I know that it's not an easy task for me, a human being, to move that stuff. E.g. links are changed while tagging (tvar cover, Special:MyLanguage prefixing and so on) and in the same time old versions often point to some localized name pages (e.g. "Сторінка" but not "Page/uk", "Page" or "Special:MyLanguage/Page") which sometimes are redirects (when target is marked for translation or in a way towards it) or are actually main name. In latter case I don't see how magic can possibly work. Also I often face pages where translation was done in let's say 2011 or 2008. It has same links but paragraphs are not always in the same sequence as they are now. Or they were 2 and now they are 1 and so on. While not understanding most languages human still can translate stuff via google translate or just understand main idea from knowing languages near to the language they are importing page in - but I doubt a machine can do it.

So summarizing said IMHO it's the best if it would be a javascript based script working by edittoolbar button pressing which would just put tags in source and then TA would finish the work by reviewing it and fixing where it's needed (should be easier than full manual tagging in most cases) or some PHP based stuff which would do exactly the same just on server's side. (I don't want to consider variants of a tool on labs or a bot either console or desktop app as it will not do for many for several reasons IMHO). The second step part should be something that'll propose content for each unit and you'll approve it. But you must see what surrounds the passage you are about to put to be sure that there are no parts of it that need to be put into unit as well. Perhaps I imagine it may look something like diff style and you can move bounds of units in part of old version and can modify text put from them in another part.

I hope that that long text of mine was understandable in spite of my poor English and phrasing skills. --Base (talk) 16:02, 13 March 2014 (UTC)


 * Hey, first of all thanks for the amazing feedback and suggestions you have given. It is the suggestions of experienced users like you which will make this a very reliable tool, and I am happy that you put forth so many of them in one go! Well,
 * The tool can be a button inside the editing window or it can also be a link under the "Tools" section (this reduces one mouse click and improves usability, imagine doing it for 10 pages at a time). I will discuss this with my mentors.
 * Yes, once triggered, the TA would get a chance to review the changes made by the Bot. The changes made can be highlighted and different color codes can be used for different types of changes made.
 * Your concerns about the languages template and the list splitting are very useful. Thank you, I will make a note of them collectively.
 * The second step as rightly pointed by you would we something like diff style. The aim would be to eliminate (tab switching + copy paste) and combine that activity into a single interface. We would we looking to use machine translations to verify the correctness of the imported translations. In case it cannot be done, it would be left to the TA to decide.


 * Thank you again for taking out time and giving me these wonderful suggestions! :) BPositive (talk) 13:21, 14 March 2014 (UTC)

Similar efforts
Hi. Thanks for letting me aware of this, this indeed has a great potential if succeeded. I've been actually working on a similar thing using GrantsBot. Please checkout m:User:Haithams/Grants:Learning patterns (look at the history) this sample page and let me know if this is similar to what you think you are about to build. cheers --Haithams (talk) 17:29, 13 March 2014 (UTC)
 * Hi Thanks for dropping by and having a look at the proposal. Yeah, that is pretty much what I will be developing over the summer. I also had a look at your Python script and it can be a good starting point for me :) Thanks! BPositive (talk) 13:03, 14 March 2014 (UTC)

Usable outside Wikimedia?
I'm a translation administrator on Wikidata and I received a talk page message inviting feedback. Since Wikidata is relatively young, it doesn't have much "legacy content". Therefore, I don't expect there will be much use on Wikidata, but Commons, Meta-Wiki, and the MediaWiki wiki (this wiki) would probably benefit greatly from this tool.

However, I have some questions: what exactly will the "tool" be? How will it be able to be installed? Will it be able to be used outside of Wikimedia on other wikis? Some other multilingual wikis might benefit from this as well. Thanks, The Anonymouse &#91;talk&#93; 18:08, 13 March 2014 (UTC)
 * Hi The Anonymouse! Thank you very much for taking time to have a look at the proposal. Though Wikidata is young enough and does not have much of legacy content, you can always use the first part of the tool, i.e, preparing the page for translation. This is applicable for all the existing pages as well as the ones which will be newly created in the future :)
 * As far as the tool is concerned, it would be mostly a PHP script (that's how it has been planned as of now & we are still thinking of possible other options). What I have thought of is that it would be triggered via a link under the "Tools" section. And yes, it could be used on other wikis as well, probably all of them who use the Translate extension :) Thank you! BPositive (talk) 05:26, 14 March 2014 (UTC)

Some notes
Please note a few problem areas: --Kaganer (talk) 00:39, 14 March 2014 (UTC)
 * 1) Transcluded templates, which can also be translated.
 * For calls translatable templates from translatable pages, curently used TNT. With using this new tools, need to check whether need to check whether called template has its translations, and if so, wrap it in TNT
 * Experience has proven (with some critical cases solved in Meta-Wiki) that TNT was not universal: TNT does not work when a translated templates includes other templates that are themselves translated, due to the self recursion of TNT because it does not just resolve the name of the effective translated template to transclude, but it also performs the template expansion.
 * On Meta, we had to use TNTN (a variant of TNT which only returns the resolved template name without expanding it itself within TNT: insteadd it's up to the caller to expand the template directly compeltely outside TNT or TNTN ; otherwise MediaWiki will complain about a template transclusion loop on TNT which it does not support at all, even if the parameters are completely different).
 * Most pages were successufull converted with TNTN when TNT did not work. In some cases, this complicates the syntax for using the translated templates, and a mechanism of autodetection can be used where the presence or absence of a dedicated "translated=yes" parameter will triger the use of TNTN on the main translatable template or another "/layout" template performing the actual code with its parameter containning the translated texts defined in "/langcode" translated templates (but this has a caveat: the /langcode" templates can then only be used directly in pages if we add explicitly the "translated=yes" to avoid recursing in TNT used in the caller context.
 * There are few tricky cases in "small" utility templates containing a few translatable items, where this is still not solved cleanly (notably the template:Main in Meta-Wiki which does not work after trying various solutions to make it work in compatibklity with existing pages that use it when it was still not translated).
 * Searching for a solution using various tricks still did not solve the issue when such template is used in main pages to translate. The fix is to avoid using these templates and integrate them in the content of the page to translate (this will add a few more translation units in them; but the translation memory will help filling them fast most of the time by a simple click in the Translation UI). Verdy p (talk) 20:44, 23 April 2014 (UTC)
 * 1) Internal links to the sections of these same page.
 * These links are different in all translations, since they lead to localized section names. These links should be detected correctly, and replace to link which leads to the language-independed anchor (previously added before target section; placed in the page template, without translation sections).
 * 1) All internal and external links should be "normalise", with converting it to "two-partie form", with adding "pipe" and link text, if not present.
 * Address of target page should be wrap in "tvar" markup:  or
 * acronym - should be autogenerate from link
 * link text - if not present in the source pages, should be repeat of link (without interwiki and protocol prefixes)
 * 1) If internal link is lead to translatable page, should be added prefix "Special:MyLanguage/" for this
 * Thanks for having a look at the proposal and for pointing out those problem areas. I will ensure that they are handled. I will document them all nicely in a separate page so that I don't miss out on anything. Cheers! BPositive (talk) 06:28, 14 March 2014 (UTC)

Cross-language plagiarism detection
Worth reading, should give some advice on how to match source paragraphs with their translations: http://wikilit.referata.com/wiki/Cross-language_plagiarism_detection (found in a category linked from wiki-research-l). --Nemo 15:26, 20 March 2014 (UTC)
 * Thanks for sharing, I will go through it. BPositive (talk) 15:39, 20 March 2014 (UTC)