Content translation/Developers/Segmentation

Background information
Sentence segmentation is the splitting of text into individual sentences, titles, captions etc. It is weakly language-specific; a single straightforward mechanism works reasonably well for many languages, but can be improved on with language-specific rules. For ContentTranslation we care most about source language segmentation.

Unicode TR29 gives a default algorithm for plain text: http://www.unicode.org/reports/tr29/#Sentence_Boundaries. For many languages including English, this works well but not perfectly (‘“Hello!” Mr Smith said’ versus ‘“Goodbye!” Mr Smith left.’). CLDR 1.4 is starting to collect language-specific segmentation rules: http://cldr.unicode.org/index/downloads/cldr-1-4/.

Further issues arise when segmenting HTML and other rich text. In particular, text features such as italics can span segments and start/end in the middle of a segment. This means that HTML’s tree structure interacts badly with segmentation: it can be hard to demarcate and manipulate a segment, as it does not exist as an HTML node.

Functional requirements for segmentation

 * Handle plain text (generic algorithm)
 * Handle plain text (language-specific extensions)
 * Handle markup that
 * can appear in/across segments, e.g. 
 * causes a segment boundary, e.g.
 * wraps segments, e.g. and
 * is not part of any segment, e.g.

Segmented structure:

TODO: Inter-segment space (between the first two segments above) may need localizing
 * In Chinese, it would need to become a full-width space
 * In other languages, space may need to be removed (or added).

Segments

 * 1) The participants were
 * 2) “in fear of their lives.Nobody knew what was happening”, even though the researchers were present.
 * 3) Nobody understood

Note: to render a segment individually, the tags need rebalancing (bold above)

Non-segmented HTML with placeholders:

Translated segments:


 * Roedd y cyfranogwyr yn “ofni am eu bywydau.”
 * Er fod yr ymchwilwyr yn bresennol, “Ni wyr neb beth a ddigwyddai”.
 * Doedd neb yn deall.

Note: in this case, the italic tags in the second translated segment are no longer at the start.

Reconstructed document:

The translation has two ... sections, whereas the source only had one. The quotation has been split in half, for grammatical/stylistic reasons.

Segment plain text
This is the approach used in most texts that discuss the linguistic aspects of segmentation, e.g. TR29. It is not sufficient to define rich text segmentation fully.

Segment an HTML string
This approach involves augmenting plaintext segmentation rules (e.g. TR29) to cope with HTML tags. This is probably not a sensible approach in the long run: a pre-existing HTML parser should be used.

Segment a DOM tree
Apply plaintext segmentation rules (e.g. TR29) within text nodes (counting multiple consecutive text nodes as one). Apply specific rules for each HTML element.

This is the most obvious approach, but working with the segmented structure is challenging: segment start/end pairs may not be balanced at a tree level, and so some segments cannot form branches. Therefore, you cannot wrap each segment in a ..., say, without rewriting the document with re-balanced tags.

Segment from a SAX event stream; output HTML fragments
Apply SAX non-validating parsing on the html fragment and attach listeners to the interesting contexts, i.e.:
 * 1) text
 * 2) Paragraph, headers
 * 3) links
 * 4) (may be) alt text or title attributes for images, videos etc

Within the text handling, build on algorithms like TR29 to do language-specific segmentation.

SAX Based Segmenter: https://gerrit.wikimedia.org/r/#/c/117406/

The HTML content(need not be valid or complete) is passed to sax parser. The sax parser used is the parser from node-sax node module. The sax parser is subclassed(inherited) to customize the needs of CX.

CXSegmenter uses CXParser. CXParser inherits SAXParser

The sax parser emits events as and when it see start tags, close tags, text. The received content is reconstructed to form the segmented content. The segmented content will have segment marks. Segment marks are nothing but nodes with a class and segmenteId. We are interested in sentences and links. So every sentence and link will have this information. Sentences will be surrounded by span and links will have the class and segmentId in the a tag itself. Following example illustrates this.

Identifying links are easy, we just need to listen for “opentag” events with tag as “a”. But identifying sentences require language specific algorithms.

In the case of English, sentence boundaries are often punctuations like period, question mark and exclamation marks. A header like h1,h2 etc are also sentences in the sense of segments A paragraph is a collection of sentences. Punctuations need not mark as end of sentences in special cases like abbreviations. Example: Dr. D. John should not be split at periods. Same is the case of 1000.000 The rules can be sophisticated depending on the level of accuracy we need in sentence segmentation. But it is nearly impossible at least in practice because of multiple reasons


 * 1) The content may not follow linguistic correctness about punctuation. It is very well possible that the content editor did not use the punctuations properly. For example, a space followed by period is the correct style of writing English, but we cannot expect that always
 * 2) A 100% accurate sentence segmentation system will consider quotes inside the sentences. Eg: He said-”I will be back soon. Please wait”. This is a complex case, because quotes contains full sentences. In practice it is possible to have a link containing multiple sentences. There could be cases like italics spanning from half of one sentence to half of next sentence.

Because of all these, the CX functionality should be fault tolerant about sentence segmentation. The sentences segmentation is used for


 * 1) Split the whole article content in a semantic way so that we can pass the parts to external service providers
 * 2) Translation memories often works with full sentences
 * 3) To create a source-translation pair to be used for machine learning purpose. We need to know how the user translated sentence A to Translation A so that we can better serve another user.
 * 4) In the UX, to give a visual sync up for the source-translation pairs

In all of the above cases it is ok to have wrong segments. They does not break any functionality, but reduces the quality of experience in varying levels.

Sentence segmentation rules for languages
The theoretical way to do this is to implement TR29 of Unicode standard. At the same time TR29 recommends tailoring of the algorithm to meet the practical pupose.

It is also possible to write language specific rules ourself, just to meet our need. What is attempted in https://gerrit.wikimedia.org/r/#/c/117406/ is a segmentation implementation written by ourself. From testing, it is found that it meets the requirements.

The segmentation rules for other languages can either fallback to English or can have their own subclasses overriding text splitting logic.

https://gerrit.wikimedia.org/r/#/c/117406/ contains CXParserHi which inherits CXParserEn and override ontext method. Hindi sentences are terminated at Devanagari Danda signs

Hindi Segmentation - Regular expression: text.replace( /([a-zA-Zअ-ह]*)([।|!|?][\s])/g, textSplit );

Segment a linear model

Use a tree transformation to change the tags that can occur within segments into character annotations, to make a more linear representation (like VisualEditor does: http://www.mediawiki.org/wiki/VisualEditor/API/Data_Model/Surface ). Denote segments by ranges in the linear model.

This option is more appealing if we can re-use VisualEditor’s code to transform the HTML to/from the linear model. The resulting linear model text sequences have a structure similar to the following:

[‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘ ’, [‘w’, ‘bold’], [‘o’, ‘bold’], [‘r’, ‘bold’], [‘l’, ‘bold’], [‘d’, ‘bold’], ‘.’]

This is similar to a plain text character array, and so is readily amenable to TR29 etc. The main disadvantage is that JavaScript's regular expression engine cannot be used on the array (though it can be used on any character individually, e.g. to test whether it is in the Malayalam block [\u0d00-\u0d7f]).