Content translation/Product Definition/LinearDoc

In HTML, inline text annotations are represented in a tree structure. This structure is inconvenient for certain string operations, including taking substrings, reordering and performing plaintext-based searches and pattern matches. These operations are important for Content Translation, particularly when performing sentence segmentation and machine translation. LinearDoc is an alternative representation, where the tree structure has been flattened. A whole block of inline marked-up text is stored as a single array of text chunks. Each chunk contains some plaintext, and an array of the inline HTML tags which apply to it. It is then easy to take substrings (by slicing the array), and to perform searches and pattern matches (by matching on the plaintext, then applying the resulting search indexes to the marked-up text).

= Representation of rich text in HTML =

Here is an example Wikipedia-like chunk of rich text:

&lt;a href="Conway"&gt;Conway&lt;/a&gt; stated that young &lt;a href="children"&gt;children&lt;/a&gt; &lt;i&gt;“understand &lt;a href="Object_permanence"&gt;object permanence&lt;/a&gt;. &lt;a href="Concealment"&gt;Concealed&lt;/a&gt; &lt;a href="Object"&gt;objects&lt;/a&gt; feature in their awareness.”&lt;/i&gt;&lt;span typeof="mw:Extension/ref"&gt;&lt;a href="#ref-5"&gt;[5]&lt;/a&gt;&lt;/span&gt; &lt;b&gt;(&lt;a href="Nielsen"&gt;Nielsen&lt;/a&gt; equivalence).&lt;/b&gt;

A simplified representation of the DOM tree would be something like this: elementNode 'a'    textNode 'Conway' textNode ' stated that young ' elementNode 'a'    textNode 'children' elementNode 'i'    textNode '“understand ' elementNode 'a'        textNode 'object permanence' textNode '. '    elementNode 'a'         textNode 'Concealed' textNode ' ' elementNode 'a'        textNode 'objects' textNode 'feature in their awareness.”' elementNode 'span' elementNode 'a'        textNode '[5]' textNode ' ' elementNode 'b'    textNode '('     elementNode 'a'         textNode 'Nielsen     textNode ' equivalence).'

There are a number of operations which are challenging to perform on either the serialised HTML or on the DOM tree, including:


 * Performing plaintext based searches and pattern matches
 * Taking substrings
 * Reordering text

= Problems =

Detecting sentence boundaries
The example text above contains three sentences:



Unfortunately, it would be difficult to implement a sentence boundary detection algorithm that worked either on the HTML representation or the DOM tree:


 * For the HTML representation, complications include inline tag text (such as &lt;span...&gt; and &lt;i&gt;) and semi-irrelevant text (like the [5] marking a reference).
 * For the DOM representation, regex-like pattern matching across the text of several nodes is difficult.

Since the rules for sentence boundary detection are different for each language, hundreds of pieces of complex code would be needed to do this correctly.

Manipulating sentences
In many sentences, like the first two in the example above, the start and end offset nodes are not siblings. This can happen in two ways:
 * "... Foo &lt;i&gt;bar.  Baz...&lt;/i&gt;" (One end of the sentence is contained in an inline node split by the sentence).
 * "... Foo &lt;/i&gt; bar &lt;i&gt;baz. Foo..." (Both ends of the sentence are contained in an inline node split by the sentence).

To manipulate the sentences individually (e.g. to highlight one sentence, or to find all links in the current sentence), we would like to wrap each sentence in a span tag, or consider it as an independent HTML fragment, or equivalent. This is impossible without complex tag rebalancing (for the HTML representation) or involved tree surgery (for the DOM tree).

Translating text
For the same reasons, it is non-trivial to reorder words and phrases in an inline HTML tree whilst preserving annotation equivalence and structural validity. Machine translation typically changes the word order, so this is a problem.

= The LinearDoc representation =

The inline HTML shown above can be transformed into the following data structure:

[	['Conway', ['']], [' stated that young ', []], ['children', [' ']], [' ', []],	['“understand ', ['']], ['object permanence', ['', '']], ['. ', ['']], ['Concealed', ['', '']], [' ', ['']], ['objects', ['', '']], [' feature in their awareness.”, ['']], ['', [], '[5] '], [' (', ['']],	['Nielsen', ['', '']],	[' equivalence).', ['<b>']] ]

Unlike the HTML representation, there is no tree structure, only a flat array of text chunks. Each text chunk contains a piece of plaintext, together with the full list of tags that apply to it. This means that a range of italics, say, is not 'opened' and 'closed': instead the <tt>&lt;i&gt;</tt> tag is repeated next to each piece of plaintext to which it applies. The reference span (the span with <tt>typeof="mw:Extension/ref"</tt>) does not appear in the plaintext at all; rather it is an extra annotation that applies to an empty string. It is clear that the HTML representation can be recovered from the LinearDoc representation (with a few minor limitations that can be overcome).

An important property of the LinearDoc is that any slice of the array structure is also a valid LinearDoc. This makes it easy to take substrings. (An individual chunk of text in the array can be split easily, by turning it into two consecutive chunks with the same annotations).

Also, concatenating two LinearDocs results in a valid LinearDoc. This, together with the substring property, means that text can be reordered easily whilst preserving annotations.

Finally, annotations are stored separately from the flow of plaintext. Therefore, plaintext offsets are trivially easy to map onto the LinearDoc. This means matching and searching can be applied to the plaintext, and the resulting offsets can then be applied to the annotated version.

= Similar data structures =

LinearDoc was inspired by the document model of VisualEditor, which uses a similar linear structure to allow diff storage and text splicing.

The GramTrans MT system uses a similar approach called "style tags".