Content translation/Product Definition/LinearDoc

In HTML, inline text annotations are represented in a tree structure. This structure is inconvenient for certain string operations, including taking substrings, reordering and performing plaintext-based searches and pattern matches. These operations are important for Content Translation, particularly when performing sentence segmentation and machine translation. LinearDoc is an alternative representation, where the tree structure has been flattened. A whole block of inline marked-up text is stored as a single array of text chunks. Each chunk contains some plaintext, and an array of the inline HTML tags which apply to it. It is then easy to take substrings (by slicing the array), and to perform searches and pattern matches (by matching on the plaintext, then applying the resulting search indexes to the marked-up text).

= Problems with inline HTML tree manipulation =

Consider the following paragraph of inline HTML:

This paragraph contains three sentences:

Unfortunately, the structure makes it difficult to detect sentence boundaries. The need to deal with the presence of inline elements and 'irrelevant' text (like the [5] marking a reference) greatly increases the code complexity required to implement the usual algorithm for English (which essentially looks for sentence-end punctuation followed by whitespace and a capital letter). Since the rules for sentence boundaries are different for each language, this would potentially mean writing hundreds of pieces of complex code.

Also, the first two sentences contain unbalanced HTML (the sentence starts have a different indentation level to the sentence ends). Therefore it is not easy to manipulate the sentences individually, e.g. a sentence cannot necessarily be wrapped in a span tag. It is possible but not trivial to re-balance the HTML to that each sentence can be treated as a unit.

Finally, translating text typically changes the word order. It is non-trivial to reorder words in an inline HTML tree whilst preserving annotation equivalence and structural validity.

= The LinearDoc representation =

The inline HTML shown above can be transformed into the following data structure:

[	['Conway', ['']], [' stated that young ', []], ['children', [' ']], [' ', []],	['“understand ', ['']], ['object permanence', ['', '']], ['. ', ['']], ['Concealed', ['', '']], [' ', ['']], ['objects', ['', '']], [' feature in their awareness.”, ['']], ['', [], '[5] '], [' (', ['']],	['Nielsen', ['', '']],	[' equivalence).', ['<b>']] ]

Unlike the HTML representation, there is no tree structure, only a flat array of text chunks. Each text chunk contains a piece of plaintext, together with the full list of tags that apply to it. This means that a range of italics, say, is not 'opened' and 'closed': instead the <tt>&lt;i&gt;</tt> tag is repeated next to each piece of plaintext to which it applies. The reference span (the span with <tt>typeof="mw:Extension/ref"</tt>) does not appear in the plaintext at all; rather it is an extra annotation that applies to an empty string. It is clear that the HTML representation can be recovered from the LinearDoc representation (with a few minor limitations that can be overcome).

An important property of the LinearDoc is that any slice of the array structure is also a valid LinearDoc. This makes it easy to take substrings. (An individual chunk of text in the array can be split easily, by turning it into two consecutive chunks with the same annotations).

Also, concatenating two LinearDocs results in a valid LinearDoc. This, together with the substring property, means that text can be reordered easily whilst preserving annotations.

Finally, annotations are stored separately from the flow of plaintext. Therefore, plaintext offsets are trivially easy to map onto the LinearDoc. This means matching and searching can be applied to the plaintext, and the resulting offsets can then be applied to the annotated version.