User:Nikerabbit/Translate v2

From mediawiki.org

Problem statement[edit]

The <translate> tags used by the Translate extension to mark translatable parts of translatable wiki pages cannot be easily supported by Parsoid (nor the PHP parser) with their current semantics. First class support in Parsoid is a requirement for implementing a better user interface in Visual Editor.

Issue details[edit]

The issues with the <translate> tags bear some resemblance with templates: because templates can be mixed freely, for example using multiple templates to construct a table, they cannot be fully isolated from parsing context. In similar manner, one can use the tags to split lists into multiple units each containing multiple list items.

But besides these similarities, there are also other issues, such as how start-of-line context is handled, as documented in T137751. All the three examples below produce a valid heading, but it would be impossible for Parsoid to know this without re-implementing the specifics how the tags are currently parsed.

 <translate>== A == <!--T:1--></translate>
 <translate>
 == A == <!--T:2-->
 </translate>
 == <translate><!--T:3--> A</translate> ==

For background, Translate uses ParserBeforeStrip hook to massage the wikitext: it removes the tags and the real (PHP) parser never sees them. This was done because the PHP parser, at the time at least, didn't provide enough flexibility to handle the markup gracefully without breaking features like table of contents. This is not the case with Parsoid, which sees the tags.

Possible evolution paths[edit]

There are two directions available for evolving the translate extension both of which are based on attaching translation semantics to DOM nodes instead of arbitrary wikitext strings.

  1. No explicit markup: There will be no explicit markup in the source text. But, HTML editors like VE will provide tooling to both (a) annotate nodes that need to be translated (b) translate nodes that have been annotated.
  2. Explicit markup with DOM semantics: There will be explicit markup in the source text, but unlike the current translate tag, the contents of the new tag will be treated as well-formed DOM structures - so, in this proposal, translation markup should be added around DOM nodes like lists, tables, list items, table rows, table cells, paragraphs, headings, plain text rather than at arbitrary places in the wikitext.

Design notes[edit]

There are three aspects to the translate extension: (a) markup used to identify translation units (b) representation of translation units in the backend (c) the translation UI for doing the actual translation.

If possible, ultimately, the extension should treat these three orthogonally. For example, while there might be translation markup around an entire list, the extension might decide to treat individual items of the list as atomic translation units and present it to the the translation UI in terms of individual items. This lowers the markup burden on the user. Alternatively, independent of how the translation units are represented in the backend, if a VisualEditor like UI is used, in a desktop view, the entire list or table might be presented for translation while accepting piecemeal translations and saving them in the backend.

Immediate short-term proposal[edit]

For the immediate short term, it makes sense to pick a reasonably scoped project that lets us prototype an alternative version of the translate extension that let us proceed along the directions laid out above.

The proposal is to create a new tag (or the same tag with a version attribute, say v="2") with different semantics. The content of the translate tag will be parsed as a well-balanced DOM structure. By introducing a new tag, we are not forced to break the existing pages (at least until we completely remove the current tag), but instead we can do a gradual migration to the new tag, using automatic scripts and/or linters where appropriate to ease the migration while also allowing us to improve the code in breaking ways before it affects too many pages.

Because this change is purely for technical reasons (to make parsing easier and to enable a better future), it is possible it could temporarily make things worse, by example of requiring a lot more markup in the wikitext. For this reason, we propose to consider adding some temporary shortcuts, such as that a long list could be made translatable just by wrapping tags around the whole list (as opposed to marking each list item individually).

It would also be a good idea, while strictly not necessary, to replace the very peculiar <tvar>...</> with something more regular.

So, this immediate proposal is not about removing <translate> tags from wikitext. That is an independent question, although likely easier to implement with the new semantics.

Additionally, right now, the translate extension identifies translation units via ids embedded in comments. However, it might be better to add these translation ids as attributes of the translate tag instead.

Open questions[edit]

  • Name of the new tag: tra, trans, i18n, _, translate-v2, trsltbl, t10e... <translate v="2">
  • Syntax for tvar?
  • The specific semantics of the new tag, and how that could be implemented and enforced in Translate/Parsoid/PHP parser.

Timeline[edit]

2018-01-16: Issues discussed by Nikerabbit and SSastry and initial proposal was created