Wikimedia Technical Conference/2018/Session notes/Improving the translation process

Description: Translation of content between language projects and localization of our products are important for helping new projects add content and for enabling collaboration between cultures. This session looks into the ways we accomplish this now and tries to identify our goals for the improvements we want to make to these processes.

Questions discussed
= Important decisions to make =

= Action items =

= New Questions =

= Detailed notes = Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.


 * Background
 * 4 types of translation we are doing -
 * localization, translating interface messages
 * Wiki.net and apps
 * Pet project of Niklas in 2006, high numbers of translators nowadays
 * Used by phabricator
 * Key to making our software available in all other languages
 * Page translation - a kind of localization of front-facing banners, technical documents - better for literal translations, no nuance to the language
 * Machine translation for translating articles
 * Translation service, machine translation of content using html
 * Using our own algorithms for this type
 * “Smart” translation on top of
 * Check-in of handout / background information
 * Question: Is “page translation” mentioned in the handout the operation behind “translate” button one can see on wiki pages - Yes
 * Main questions of the session:
 * How can localization practices be made consistent and well-integrated with translatewiki and our other language infrastructure? How do we normalize the method of handling multilingual content (different projects vs. single project)? How do we handle variants in a consistent sustainable way?
 * Used to be slow process as the interface message takes a week to appear in the wiki; localization updates happen each night, which updates the localization cache
 * Incident recently of malicious translation
 * Fundamental problem is that the localization process needs some core updates -> how can we make this a seamless process from the developer to the user on the wiki
 * Q from Corey for product people - should source code translations be managed on the front end?
 * Is there a reason on principle for it being part of the core process? Josh says not a lot of interest in that; wants to provide guidance to more effectively translate it. Example of kwix as having expectations of support beyond money and expertise; not a desire to bring the entire system in-house
 * Security is a concern - this is done manually - perhaps the bots could be smarter to mitigate the security risk
 * Looking at opening up the translation process to have more micro-translations; is also a massive undertaking
 * Back to S - the code is out of date and from years ago
 * People don’t see it as a core or technical contribution, since translation is seen as a non-tech issue by the users
 * There aren’t active admins OR legal entity for translate.wiki.net - personal liability is a concern
 * Subbu: Conflict between these two points - quickness and safety; q about how much security against vandalism there is, which there is basic protections against vandalism
 * Adam: Do we have adequate coverage across the language we’re trying to reach; do we have sufficient interface translations; proportionate to the number of users in the language
 * Joaquim: Wikis fight vandalism by having fast edits; having a cache in the db where the live messages are kept on the fly
 * People treat it like wiki-text where you can mix things; putting in labels telling people not to do this
 * Page translation feature: it doesn’t work as it should, because you are inserting markers and that’s a horrible thing to work with; this is not prioritized at the moment. It is used for time-sensitive and important things, like fundraising banners.
 * Subbu: Do what extent can content translation be used for these things? Can you integrate machine translation into this?
 * No freedom to skip sections of the page, all or nothing; big challenge is that the system has no way to identify these changes and keep them in sync
 * Clarification of Subbu’s q - can you use this as an API?
 * Josh: if both of these are using [?] mark-up, there are ways of knowing beyond just string text comparison; this element has changed therefore this translation needs to change
 * Fundamental problem is the marking system
 * If we knew if we could fix this, we knew things would pick up, more people would be invested in helping
 * J: we should get rid of this part of the translation system, having software to enforce something that can be enforced by humans, esp given different structure of paragraphs and sentences between
 * Action Item: move toward plain strings instead of html
 * Action Item: Clean Up the Mark-Up on Import
 * Action Item: Back-end security scrubbing (Javascript)
 * Josh: product managers don’t like this and this has failed to be prioritized as the executive level, we prioritize external impacts over internal; however this is a level of affecting participation, esp since movement strategy documents aren’t being translated effectively due to this broken tool
 * Machine translation is not “ours” but is coming from other 3rd party groups; there are not many FOSS options
 * It isn’t sustainable / how sustainable if we are relying on proprietary services. How do we solve this problem?
 * Syncing is an issue that we don’t have a solution for at the moment, and needs things to exist.
 * Also issue of only being able to translate to a language where that page doesn’t exist; sometimes there is a language mismatch where one lang version is very full but another is a stub, you cannot use the current tools to fix that issue
 * Relying on local maps to anchor these translations, which is not effective on a large scale
 * Discussion of article translation:
 * M: For keeping it in sync, how about translating the difs of the revisions?
 * subbu:	is translation seen as a seeding tool OR keeping wikis in sync? Keeping them in sync doesn’t seem sustainable.
 * How important machine translation is for us; v important for the arabic wikipedia, machine translation has issues and security risks, but the good starting point is machine translation and working to improve that how sustainable the mt engines are -> idea, start utilizing our own tools like wikidata, which can give us accurate equivalence as opposed to services like Yandex, especially for non Euro langs; also look into the experience of other translation tools open source translation interfaces. Moving away from paragraphs but instead to sentences as it highlights gaps that people who can fill, and can help with synchronizing translates.
 * MT should support minority languages better. Wikidata can help translating terms
 * Joaquim: depending on the article, translations are bad, which takes longer to fix the original thing than to just translate. Also term-level translation in the editor. We should be aware of digital colonialism; if we are overly relying on translation, we can be perpetuating that, others are “shallow copies” of dominant culture wikis
 * Jon: Q of forking vs keeping in sync; if you fork in mass, there are maintenance issues; but if you speed it up, there’s a lot of content in a language that doesn’t have the community to maintain it, so the updates would get lost; policing and updating is a hard burden to place on communities; also if we don’t have a way to inter-related with these 3rd party systems we will get left behind
 * Josh: maps of content changes as opposed to trying to keep everything in sync or just allowing forking. “Flagging” of collections of articles across multi language as a possible solution.
 * Q: is any of this heuristics work exposed through an API, or is it internal? (its a service) is it by hand (yes) don’t globalize templates but use a semantic directory -> we always have a machine map
 * Corey: with the translation, we can have a UI that shows changes in other languages, and mixing stuff like the scoring systems into that UI
 * Subbu: as a process question, there is an unresolved q of sync vs fork; is there a question we can use to frame this?
 * S: This q needs clarification of if it’s only for new articles or if it’s for syncing or what
 * Discussion of machine translation as a service:
 * Do we see a future where we are developing our own engine
 * Now is the time to start thinking about having our own engine; unsure of the time and effort but we have the elements to start, an interface to see this, and other engines to model off of
 * Don’t want to get stuck in a dependency on other engines
 * Maintenance is a big commitment as the languages evolve
 * Something to use in our negotiations with google, providing them with a good text base and they pro
 * Google was saying, “whatever data wikipedia has, google has more”
 * Questions of how parallel things are as opposed
 * Why consider not improving one instead of building from scratch; we are doing this; Joaquim has an offer of head-hunting for this
 * Cheol: Apertium is a rule-based translator which is not good for translating from english to korean. Statistical translation model or Deep learning would be better for the language pair. Do we have more data rather than parallel corpora such as online translators’ activity logs. We can capture translation process such as online editing behavior, such as cursor movement, or substituting terms or hesitating to edit. Can we do a better job than Google using these data or could collaborate with machine translation service provider for research or development with the assets.
 * Corey: we are implicitly making the decision that we are not investing ourselves in a machine translation tool. Y/n? Either we’re getting rid of it or we’re working with an engine as part of a deal.
 * V: getting good language pairs from one engine is hard (Josh: especially when there isn’t a business incentive), better to use several. Build a plan for technology partnerships.
 * Built an API service that provides translation for devs; translating talk page messages, or translating commons, anything. What are more use cases? What are high priority cases?
 * Android app, wants to support caption translations in commons and structured data
 * We do know what languages people have expressed an interest in; we can recommend new users or new positions to invest in this translation process and enrich commons, as they might not know about the API
 * Action item: readers infrastructure will sync with S on this
 * Action item: Term to term translation
 * Q from V - how do we prioritize language pairs? Do we have a sense of that road-map?
 * S: we do know about highly active language pairs, and tracking when people use language switcher tool; we do have that data (ex. Spanish and Catalan)
 * V: a paper was published about this topic by Danny V, have we looked at this? Josh: this is the crazy solution, a wikidata-esque solution
 * Subbu: language variance is part of a continuum, we should discuss this further