Content translation/Product Definition/Abuse prevention

From mediawiki.org

With the help of machine translation, it is possible that a translator publish machine translated version without any manual edits or proofreading. Such an article will have low quality and not desired. To address this issue, we can use 3 strategies:

  1. Educate. Convey that the focus is more on quality than on quantity.
  2. Warn. Detect potential patterns that lead to low-quality (unmodified automatic translation or pasted text).
  3. Inform the community. Allow other users to easily find potential problematic content.

Translation Instructions[edit]

When user see Translation view, the tools column will have an instructions card, that convey that the translator is not supposed to use machine translation as such. Manual editing and proofreading is encouraged to read the article more natural in the target language.

Abuse detection[edit]

Warning shown when machine translation is used for translating the article without any manual edits

As the translation progress, an abuse detection algorithm will keep track of the amount of manual edits and unmodified machine translation. When this algorithm detects a potential machine translation misuse, a warning card will appear in the tools column to inform this to translator

  1. A threshold level for unmodified machine translated content is defined. This is a percentage of amount of machine translated section out of total translation. The value of this is 95%. This is a non-configurable value now, but depending on testing and feedback, this might be configurable in future.
  2. The amount translation is not counted as number of sections translated. It is counted as a cumulative weight of each section. Weight of a section is defined as a ratio between total number of characters in the plain text version of the section by total number of characters in the plain text version of source article. This means, a section title will have lesser weight than a paragraph. A small paragraph will have lesser weight than a bigger paragraph.
  3. Since the threshold is compared against the relative machine translation amount, there is a practical issue of 100% machine translation reached as the user just click on first section. Theoretically, at this point, the whole translation is machine translation. But we don't show the machine translation misuse warning at this stage, even though the threshold is met. We also consider the number of sections translated. If 5 sections are translated with unmodified machine translation, then we show abuse warning. That means, if a translator keeps on clicking the placeholders and fills more than 5 sections, the abuse warning will appear in translation column. Now there is a problem if the source article is very small and 5 sections means most of the article content. Or even there is a chance that the source article has less than 5 sections. To address this, we improvise the algorithm of 'more than 5 section' as this: 75% of total translation or 5 sections - whichever comes first.
  4. The following changes to the sections are considered as manual edits and corresponding sections are not considered as unmodified machine translated content:
    1. Any edits in the section by keypresses
    2. Adapting a link - adding or removing a link
    3. Adapting a reference- removing a reference
    4. Format changes to the section- making selected text bold, italics, underline, bullet list or numbered lists
  5. In the case of languages without a machine translation, source text is used as fallback. If machine translation is disabled, then also source text is used as fallback - For abuse detection, using the unmodified source text is also considered as unmodified MT.