Content translation/Developers/Progress calculation

From mediawiki.org

Translation Progress Algorithm[edit]

The algorithm implementation goes through the implementation of version 1,2,3 in order. Each version improves the previous version.

Version 1[edit]

The source article is divided to sections. Examples for sections are headers, tables, paragraphs, images. They are logical and structural units or division of the whole article.

In simple terms, a 100% translation is done when there is some translated content exist for every section. Within a section, we need not consider whether all sentences are translated or not. Because the nature of translation also involves summarization of paragraphs or sections in general.

That means, if a source paragraph has 10 sentences, if user summarize all these 10 sentences to 5 sentences, we need to consider that translation of that paragraph is done. That 50% of sentences are translated.

So, if the source article has 10 paragraphs, 5 images, 3 headings, 2 tables, we have 20 units of translation. Now if the translator translates 3 paragraphs, 2 headings, 1 tables and 3 images, 8 units of translation is done. The percentage of progress is 40%.

Version 2[edit]

In version 1, we considered translating a header as same as translating a long paragraph. In this version we improve it.

Instead of counting every section as 1 unit, no matter whether it is header or paragraph, we assign a weight for each section depending on size.

By size, we mean the amount of data to be translated. Technically, this is string length of text version of section.

A header with string length 10 will have a weight 10. A paragraph with 100 string length is of weight 100. Total number of units to translate is a sum of all these weights. It is roughly the total number of characters in the plain text version of the source article.

When a translator translates a paragraph with weight=100, the translation progresses by 100 units. Note that we don't care whether translation is 10 characters or 1 sentence or all sentences.

Version 3[edit]

In previous versions we are giving full freedom for summaring the paragraph. We count a 1000 character paragraph as translated, even if the translation is 10 characters. We try to put a threshold for summarization. If the translation is below 5% of string length, we don't count that paragraph as translated. ie if the translation has 40 characters and soure has 1000 characters, the paragraph is not considered as counted. Note that we are comparing two languages or even scripts in this.