Extension talk:Translate/Mass migration tools

Feedback
Ok, this looks good enough a draft to seek more comments, especially as you need to write the "use cases" section and that can only be done by hearing from users: so far you only had the point of view of one translation admin, me; I can "represent" most of the concerns from MediaWiki.org, Meta-Wiki and Commons translation admins but not everything. Stakeholders/venues to contact, where you can post an invite to come check your proposal, comment it and watchlist it: --Nemo 12:45, 2 March 2014 (UTC)
 * Project talk:Language policy, Project:Current issues
 * OSM
 * userbase.kde.org and other wikis of the family (whatever place they use to discuss)
 * m:Meta talk:Babylon
 * commons:Commons:Translators' noticeboard
 * wikidata:Wikidata:Translators' noticeboard
 * mediawiki-i18n, extension talk:Translate
 * Other wikis, among the 100+ using Translate, which have a lot of old translations

Feedback plus suggestions
Having read your proposal, I think it is worth being implemented.

Question: How to implement it?
 * As an extension inside MediaWiki?
 * Somewhat indepentent, such as with the Pywikibot framework?

A workflow question: Importing existing translations (i.e. step 2) likely often needs to be done by people who can read and understand these translations well enough. It may take a long time to find these experts. What happens meanwhile, so as to not hamper translating to other languages? Is it possible to have a consistent mix of unsplit translations while other language pages and the source page are split already?

Ideas and suggestions: --Purodha Blissenbach (talk) 11:37, 10 March 2014 (UTC)
 * First time splitting a source page into translateable units is language dependant. At least it depends on language types and writing systems. I would suggest to create some very basic code for it, and then only implement English, and thus likely some other Latin script based European languages, to begin with. English would be the most predominant use case anyways.
 * Splitting strategies vary on text types. Thus allow users to choose the best one. At least, I would suggest to have "by sentence", plus "by paragraph", plus, of course, what existing markup may suggest.
 * Hi, thank you very much for your feedback and suggestions.
 * It would be implemented as something independent, it won't be an Extension inside MediaWiki. I am fine with both PHP and Python, but the "Skills required" part of the project listed "PHP" as a required skill. Basically, it would be a bot asking for confirmation when needed.
 * I am not sure about the relevance of the splitting strategy here, as that is already the job of Translate extension. The Translate extension does the job of splitting into units, once the page is prepared and marked for translation, which is what the first part of this project is all about. And then, the step 2 would import the translations already present before Fuzzy Bot's edit (example). Right now, this importing job is tedious copy-paste work and the person doing that need not know the various languages. Having the source text (English) and the translated page before Fuzzy bot's edit is sufficient enough to do that. The second part is about automating this work. Hope this clears the workflow.
 * BPositive (talk) 14:37, 10 March 2014 (UTC)
 * Yes, it does. Thank you! --Purodha Blissenbach (talk) 23:16, 10 March 2014 (UTC)

Base's feedback
The idea is very nice and any tool that would ease the marking work is much appreciated as it takes a long time to tag properly even an easy-to-tag page. And thank you with all my heart for working on making TAs' life easier. But unfortunately such a tool just may be as some button in the editing window which would make preparation but leaving the final strokes for a TA. I just can't see how it can be another way. E.g. let's look at the 2 item in the algorithm - putting languages-tag at the to - nice in most cases - but sometimes translatable pages have some fancy divs and we must put it in that container to have the languages list rendered properly. Or how can It know when a list should be splitted to a line per a unit or not. E.g. if it's just a very short lines and the list is unlikely to be updated often - than it's OK to keep it a single unit. If list items are long stuff than we'd better split them to many units (as e.g. myself as a translator hate translating long lists in one unit). But there happens a case where the next list item continues a sentence from the previous one - in that case we should keep that items in one unit surely if we don't want to get a mess in translated pages (e.g. when one translator translates first part and another last part and it yields in having hardly understandable structure for a reader.). Step 2 seems to me even more longing on the one hand as it's probably the hardest part of TAs life to move that old translations to their new place especially when you deal with a long page with over 10 finished translations. But on the other hand I know that it's not an easy task for me, a human being, to move that stuff. E.g. links are changed while tagging (tvar cover, Special:MyLanguage prefixing and so on) and in the same time old versions often point to some localized name pages (e.g. "Сторінка" but not "Page/uk", "Page" or "Special:MyLanguage/Page") which sometimes are redirects (when target is marked for translation or in a way towards it) or are actually main name. In latter case I don't see how magic can possibly work. Also I often face pages where translation was done in let's say 2011 or 2008. It has same links but paragraphs are not always in the same sequence as they are now. Or they were 2 and now they are 1 and so on. While not understanding most languages human still can translate stuff via google translate or just understand main idea from knowing languages near to the language they are importing page in - but I doubt a machine can do it.

So summarizing said IMHO it's the best if it would be a javascript based script working by edittoolbar button pressing which would just put tags in source and then TA would finish the work by reviewing it and fixing where it's needed (should be easier than full manual tagging in most cases) or some PHP based stuff which would do exactly the same just on server's side. (I don't want to consider variants of a tool on labs or a bot either console or desktop app as it will not do for many for several reasons IMHO). The second step part should be something that'll propose content for each unit and you'll approve it. But you must see what surrounds the passage you are about to put to be sure that there are no parts of it that need to be put into unit as well. Perhaps I imagine it may look something like diff style and you can move bounds of units in part of old version and can modify text put from them in another part.

I hope that that long text of mine was understandable in spite of my poor English and phrasing skills. --Base (talk) 16:02, 13 March 2014 (UTC)

Similar efforts
Hi. Thanks for letting me aware of this, this indeed has a great potential if succeeded. I've been actually working on a similar thing using GrantsBot. Please checkout m:User:Haithams/Grants:Learning patterns (look at the history) this sample page and let me know if this is similar to what you think you are about to build. cheers --Haithams (talk) 17:29, 13 March 2014 (UTC)

Usable outside Wikimedia?
I'm a translation administrator on Wikidata and I received a talk page message inviting feedback. Since Wikidata is relatively young, it doesn't have much "legacy content". Therefore, I don't expect there will be much use on Wikidata, but Commons, Meta-Wiki, and the MediaWiki wiki (this wiki) would probably benefit greatly from this tool.

However, I have some questions: what exactly will the "tool" be? How will it be able to be installed? Will it be able to be used outside of Wikimedia on other wikis? Some other multilingual wikis might benefit from this as well. Thanks, The Anonymouse &#91;talk&#93; 18:08, 13 March 2014 (UTC)

Some notes
Please note a few problem areas: --Kaganer (talk) 00:39, 14 March 2014 (UTC)
 * 1) Transcluded templates, which can also be translated.
 * For calls translatable templates from translatable pages, curently used TNT. With using this new tools, need to check whether need to check whether called template has its translations, and if so, wrap it in TNT
 * 1) Internal links to the sections of these same page.
 * These links are different in all translations, since they lead to localized section names. These links should be detected correctly, and replace to link which leads to the language-independed anchor (previously added before target section; placed in the page template, without translation sections).
 * 1) All internal and external links should be "normalise", with converting it to "two-partie form", with adding "pipe" and link text, if not present.
 * Address of target page should be wrap in "tvar" markup:  or
 * acronym - should be autogenerate from link
 * link text - if not present in the source pages, should be repeat of link (without interwiki and protocol prefixes)
 * 1) If internal link is lead to translatable page, should be added prefix "Special:MyLanguage/" for this