Content translation/Product Definition/Round Trip

From mediawiki.org

This document tries to identify if there is any requirement for clean diff while going through wiki text <-> html conversions. We list all known potential scenario where these conversions happen as use cases. Note that not all of these usescases are identified in the scope of development plan. They may or may not appear in future product roadmap.

Consider the use case: As a Hindi Wikipedian, I want to translate a source article in enwiki to a target article in hiwiki.

Scenario 1: hiwiki article does not exist

  • User opens the English article, translates, publishes to target wiki.
  • Usual wiki editing features are used for further edits (CX not used for later versions).
  • Since CX created the target article, nothing to discuss about diffs.

Scenario 2: As Scenario 1, but the user stops before publishing, saving as draft in Draft namespace. User then comes back to continue translation using Translation Center.

  • User sees all the previous translation without losing any data.
  • User publishes like Scenario 1, or repeats Scenario 2 any number of times.
  • Diff between two versions in draft namespace is clean. For example, if the user added a single new paragraph, that alone is the diff.

Scenario 3: Scenario 1 or 2 has happened, and the target article has been published in hiwiki. Now a user wants to translate some updated content from the enwiki article. (For now, we assume the target article is unchanged since publication).

  • The parallel translation text is reconstructed:
    • The previous source article snapshot is loaded from enwiki's history.
    • The unchanged target article is loaded from hiwiki.
    • The alignment mapping between source blocks and target blocks is reloaded from the translation memory (to avoid error-prone re-alignment)
  • Subsequent changes to the source article are merged automatically into the parallel text:
    • Changes are highlighted in the source text, for clarity.
    • Parallel changes are made to the target text automatically, with segments deleted, inserted as candidates or marked as "outdated" accordingly. Highlighting is applied for clarity.
    • The user "resolves" the translation, by translating, modifying or undoing changes to text, then saves. (The user is free to change anything, just as in normal wiki editing)

Notice that the applying of changes as described above requires a change annotation system (backend and UI) for putting the following information into the target text:

  • Candidate deletions (where a source segment has been deleted)
  • Candidate insertions (where a source segment has been inserted)
  • Outdated segments (where a source segment has been modified): logically just deletion-then-insertion, with the old target sentence being a likely TM candidate, but may look different in UI terms.

This is a little like "track changes" in a word processor (the user can accept or reject each chunk). It's also a little like a 3-way merge in a system like git. The main difference from both is that to actually apply an insertion, the user must translate it (or at least approve the candidate text).

Scenario 4: As Scenario 3, but the target article also contains some changes since publication.

  • The parallel translation text is reconstructed and changed as in the first two parts of Scenario 3 (but loading the old target article from hiwiki's history).
  • Changes on the target side since translation are merged into the parallel text.
    • Conflicts marked as appropriate.
    • Further highlighting is applied for clarity.

Notice that the change annotation system requires two extra features over the previous scenario:

  • Local changes (where the target text has been modified since translation)
  • Conflicts (where source changes and target changes cannot bothe be applied)

Scenario 5: As Scenario 4, but the source text is editable as well.

  • The system of marking insertions, deletions, outdated segments, local changes and conflicts is applied to the source text as well as the target text.
  • The user can modify and publish the source article too (so the words 'source' and 'target' are less meaningful).

In data terms, this is actually not that much harder than Scenario 4 (it's quite similar to doing Scenario 4 twice, once in both directions). However, there are interesting UI challenges.