Wikimedia Technical Conference/2018/Session notes/Improving the translation process

From mediawiki.org
Theme Defining our products, users and use cases
Type Evaluating Use Cases
Session Leader Santhosh Thottingal
Facilitator Leszek Manicki
Scribe Irene

Description: Translation of content between language projects and localization of our products are important for helping new projects add content and for enabling collaboration between cultures. This session looks into the ways we accomplish this now and tries to identify our goals for the improvements we want to make to these processes.

Questions discussed[edit]

Question Significance Answers
How can localization practices be made consistent and well-integrated with translatewiki and our other language infrastructure? How do we normalize the method of handling multilingual content (different projects vs. single project)? How do we handle variants in a consistent sustainable way? This will identify the various translation workflows, identifying the necessary elements in the architecture. TranslateWiki is a community maintained project and outside of WIkimedia Infrastructure. This has many impacts (integration and security) that we should evaluate. Also, do product have any problem with translate wiki, as opposed to incorporating the tool into wikiprojects?  (outside tool, delay, security, etc.).

Answer: Product doesn’t have any big issues, here, but:

  • We should be able to give more guidance to translators.
  • Review is built-in in translatewiki.net to prevent vandalism
  • Should be machine reviewable for security (??). Clean up the translations on import into wikis
  • Currently accepts wikitext, but could also only accept strings to prevent code
  • Do we have enough coverage on translate wiki?
  • What is sustainability of this independent project? Potentially concerning that it isn’t tied to an entity.
  • Generally, it would be good to update Translate extension’s page translation feature and consolidate it with Content Translation, but not a high priority historically.  But this is potentially an equity issue since inability to use translate effectively affects the dissemination across languages (ex: policies, announcements, etc. on metawiki).


How do we improve translations and moving content across languages? This will address changes that need to be made to improve the translation workflow to enable better and faster translations. Product decision to make: are we using content translation for forking content (one time seeding) or regularly syncing translated content or some hybrid strategy?
  • Synch is hard, but forking problematic unless you give UX signals to editors that articles are changing.
  • Idea Proposal: Periodically translate source article diffs. to keep articles synced.
  • Term-level translation would be useful - Dictionary support
  • Not having global templates is a problem. But semantic mapping to template also work
  • Build a plan for technology partnerships with machine translation service providers. Multiple machine translation services are required for meeting the needs of various language pairs.
What are the use cases of machine translation in our current and future projects? Can the machine translation service built for the Content Translation project be used for talk pages, image caption translation in commons, updating existing articles, etc.? Currently most communication on wikis is within a single language due to how projects are architected. If more collaboration is desired across languages, especially in non-language specific projects, then we need to build tools that support communication including machine assisted translations of conversations.
  • Reading infrastructure has some use cases(immediate), they will contact Language team
    • Translating captions in mobile apps
  • Not much people aware of this API.

Important decisions to make[edit]

What are the most important decisions that need to be made regarding this topic?
1.  Open question: Strategy: Is content translation about forking or synching content across wikis? (see answer to #2 above)
Why is this important?

Will the over usage of English as source wiki will cause digital colonisation(as in minimizing the importance of other language wikis?)

What is it blocking?

While many strategies look at Translation as a method for filling content gap, without a clear strategy, development cannot proceed coherently.

Who is responsible?

Product, Technology

2. How comfortable are we with 3rd party proprietary MT dependence?

Decision: From the discussions, we’re not building our own MT engine, so we will use proprietary engines. But not a single provider considering multiple language pairs which are not supported enough by a single engine. Careful agreements, We are open to help opensource MT engines with our corpora and grants.

Why is this important?

More and more content depending on good will/shared purposes of corporations.

What is it blocking?

(A certain type of machine translation technology still is not good enough for multiple language pairs. )

Who is responsible?

Technology and Product leadership

3. Translatewiki.net as a critical infrastructure for WMF, but not part of our infrastructure. Sustainability is a question.
Why is this important? What is it blocking? Who is responsible?

Platform

Action items[edit]

What action items should be taken next for this topic? For any unanswered questions, be sure to include an action item to move the process forward.
1. There are use cases for Content translation service. Reading infrastructure and Android to synch with Santosh about use of API in upcoming Android feature.
Why is this important?

The API is available and free to use for anybody.

What is it blocking?

There seems to be a lack of awareness?

Who is responsible?

Reading infrastructure

2. Global templates need to be prioritized, productized. Alternatively semantic map of template parameters is required for translating content across languages
Why is this important?

It prevents adapting content from one language to another. Many templates hold very important data about article, such as infoboxes

What is it blocking?

Product prioritizing and roadmap definition

Who is responsible?

Core platform

3. Translate extension need its technical maintenance addressing the technical, security issues it has(VE integration)Translatewiki.net’s position as separate entity need to be examined to see how to support it better
Why is this important?

Page translation is used for very critical tasks inside foundation(such as strategy, fundraising, policies). No reason to discard it

What is it blocking?

Page translation not working with VE is a blocker for better editing.

Localization Updates are affected by the technical debt.

Who is responsible?

Product(Language)

New Questions[edit]

What new questions did you uncover while discussing this topic?
Is it a better strategy for us to develop our own translation engines and software, or to continue working with third-parties and continue trading our data for their engines? And how do we continue sustainable practices if working with proprietary systems?
Why is this important?

If we need to develop our own tools, it is a long process and we need to start building this sooner rather than later. If we stay with 3rd party engines, we need to safeguard our interests so we are not left in the cold.

What is it blocking? Who is responsible?

Detailed notes[edit]

Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.

  • Background
  • 4 types of translation we are doing -
    • localization, translating interface messages
    • Wiki.net and apps
    • Pet project of Niklas in 2006, high numbers of translators nowadays
    • Used by phabricator
    • Key to making our software available in all other languages
    • Page translation - a kind of localization of front-facing banners, technical documents - better for literal translations, no nuance to the language
  • Machine translation for translating articles
  • Translation service, machine translation of content using html
    • Using our own algorithms for this type
    • “Smart” translation on top of
  • Check-in of handout / background information
    • Question: Is “page translation” mentioned in the handout the operation behind “translate” button one can see on wiki pages - Yes
  • Main questions of the session:
    • How can localization practices be made consistent and well-integrated with translatewiki and our other language infrastructure? How do we normalize the method of handling multilingual content (different projects vs. single project)? How do we handle variants in a consistent sustainable way?
  • Used to be slow process as the interface message takes a week to appear in the wiki; localization updates happen each night, which updates the localization cache
  • Incident recently of malicious translation
  • Fundamental problem is that the localization process needs some core updates -> how can we make this a seamless process from the developer to the user on the wiki
  • Q from Corey for product people - should source code translations be managed on the front end?
  • Is there a reason on principle for it being part of the core process? Josh says not a lot of interest in that; wants to provide guidance to more effectively translate it. Example of kwix as having expectations of support beyond money and expertise; not a desire to bring the entire system in-house
  • Security is a concern - this is done manually - perhaps the bots could be smarter to mitigate the security risk
  • Looking at opening up the translation process to have more micro-translations; is also a massive undertaking
  • Back to S - the code is out of date and from years ago
  • People don’t see it as a core or technical contribution, since translation is seen as a non-tech issue by the users
  • There aren’t active admins OR legal entity for translate.wiki.net - personal liability is a concern
  • Subbu: Conflict between these two points - quickness and safety; q about how much security against vandalism there is, which there is basic protections against vandalism
  • Adam: Do we have adequate coverage across the language we’re trying to reach; do we have sufficient interface translations; proportionate to the number of users in the language
  • Joaquim: Wikis fight vandalism by having fast edits; having a cache in the db where the live messages are kept on the fly
  • People treat it like wiki-text where you can mix things; putting in labels telling people not to do this
  • Page translation feature: it doesn’t work as it should, because you are inserting markers and that’s a horrible thing to work with; this is not prioritized at the moment. It is used for time-sensitive and important things, like fundraising banners.
  • Subbu: Do what extent can content translation be used for these things? Can you integrate machine translation into this?
  • No freedom to skip sections of the page, all or nothing; big challenge is that the system has no way to identify these changes and keep them in sync
  • Clarification of Subbu’s q - can you use this as an API?
  • Josh: if both of these are using [?] mark-up, there are ways of knowing beyond just string text comparison; this element has changed therefore this translation needs to change
  • Fundamental problem is the marking system
  • If we knew if we could fix this, we knew things would pick up, more people would be invested in helping
  • J: we should get rid of this part of the translation system, having software to enforce something that can be enforced by humans, esp given different structure of paragraphs and sentences between
  • Action Item: move toward plain strings instead of html
  • Action Item: Clean Up the Mark-Up on Import
  • Action Item: Back-end security scrubbing (Javascript)
  • Josh: product managers don’t like this and this has failed to be prioritized as the executive level, we prioritize external impacts over internal; however this is a level of affecting participation, esp since movement strategy documents aren’t being translated effectively due to this broken tool
  • Machine translation is not “ours” but is coming from other 3rd party groups; there are not many FOSS options
  • It isn’t sustainable / how sustainable if we are relying on proprietary services. How do we solve this problem?
  • Syncing is an issue that we don’t have a solution for at the moment, and needs things to exist.
  • Also issue of only being able to translate to a language where that page doesn’t exist; sometimes there is a language mismatch where one lang version is very full but another is a stub, you cannot use the current tools to fix that issue
  • Relying on local maps to anchor these translations, which is not effective on a large scale
  • Discussion of article translation:
    • M: For keeping it in sync, how about translating the difs of the revisions?
    • subbu: is translation seen as a seeding tool OR keeping wikis in sync? Keeping them in sync doesn’t seem sustainable.
    • How important machine translation is for us; v important for the arabic wikipedia, machine translation has issues and security risks, but the good starting point is machine translation and working to improve that how sustainable the mt engines are -> idea, start utilizing our own tools like wikidata, which can give us accurate equivalence as opposed to services like Yandex, especially for non Euro langs; also look into the experience of other translation tools open source translation interfaces. Moving away from paragraphs but instead to sentences as it highlights gaps that people who can fill, and can help with synchronizing translates.
    • MT should support minority languages better. Wikidata can help translating terms
    • Joaquim: depending on the article, translations are bad, which takes longer to fix the original thing than to just translate. Also term-level translation in the editor. We should be aware of digital colonialism; if we are overly relying on translation, we can be perpetuating that, others are “shallow copies” of dominant culture wikis
    • Jon: Q of forking vs keeping in sync; if you fork in mass, there are maintenance issues; but if you speed it up, there’s a lot of content in a language that doesn’t have the community to maintain it, so the updates would get lost; policing and updating is a hard burden to place on communities; also if we don’t have a way to inter-related with these 3rd party systems we will get left behind
    • Josh: maps of content changes as opposed to trying to keep everything in sync or just allowing forking. “Flagging” of collections of articles across multi language as a possible solution.
    • Q: is any of this heuristics work exposed through an API, or is it internal? (its a service) is it by hand (yes) don’t globalize templates but use a semantic directory -> we always have a machine map
    • Corey: with the translation, we can have a UI that shows changes in other languages, and mixing stuff like the scoring systems into that UI
    • Subbu: as a process question, there is an unresolved q of sync vs fork; is there a question we can use to frame this?
    • S: This q needs clarification of if it’s only for new articles or if it’s for syncing or what
  • Discussion of machine translation as a service:
    • Do we see a future where we are developing our own engine
      • Now is the time to start thinking about having our own engine; unsure of the time and effort but we have the elements to start, an interface to see this, and other engines to model off of
    • Don’t want to get stuck in a dependency on other engines
    • Maintenance is a big commitment as the languages evolve
    • Something to use in our negotiations with google, providing them with a good text base and they pro
    • Google was saying, “whatever data wikipedia has, google has more”
    • Questions of how parallel things are as opposed
    • Why consider not improving one instead of building from scratch; we are doing this; Joaquim has an offer of head-hunting for this
    • Cheol: Apertium is a rule-based translator which is not good for translating from english to korean. Statistical translation model or Deep learning would be better for the language pair. Do we have more data rather than parallel corpora such as online translators’ activity logs. We can capture translation process such as online editing behavior, such as cursor movement, or substituting terms or hesitating to edit. Can we do a better job than Google using these data or could collaborate with machine translation service provider for research or development with the assets.
    • Corey: we are implicitly making the decision that we are not investing ourselves in a machine translation tool. Y/n? Either we’re getting rid of it or we’re working with an engine as part of a deal.
    • V: getting good language pairs from one engine is hard (Josh: especially when there isn’t a business incentive), better to use several. Build a plan for technology partnerships.
  • Built an API service that provides translation for devs; translating talk page messages, or translating commons, anything. What are more use cases? What are high priority cases?
    • Android app, wants to support caption translations in commons and structured data
    • We do know what languages people have expressed an interest in; we can recommend new users or new positions to invest in this translation process and enrich commons, as they might not know about the API
    • Action item: readers infrastructure will sync with S on this
    • Action item: Term to term translation
    • Q from V - how do we prioritize language pairs? Do we have a sense of that road-map?
    • S: we do know about highly active language pairs, and tracking when people use language switcher tool; we do have that data (ex. Spanish and Catalan)
    • V: a paper was published about this topic by Danny V, have we looked at this? Josh: this is the crazy solution, a wikidata-esque solution
    • Subbu: language variance is part of a continuum, we should discuss this further