Topic on Talk:Content translation/Specification

Nemo bis (talkcontribs)

I came to this project today after reading a very surprising (to me) statement : «copies the HTML and saves it to the user page. It must be changed so that the edited HTML would be converted to maintainable wiki text before saving». So you plan to translate HTML rather than wikitext? I never heard of this and I couldn't find a mention of this crucial architectural decision anywhere; only an incidental mention in a subpage, «ability of MT providers to take html and give the translations back without losing these annotations».

If you're translating HTML, why? Because of VisualEditor I suppose? Are the annotations above those that parsoid uses?

Does this mean that you're not going to translate wikitext at all and that you'll just drop all templates, infoboxes and so on? Did anyone before ever try to reverse engineer HTML to guess the local wikitext for it? Isn't it like building thousands of machine translation language pairs, one for every dialect of wikitext (each wiki has its own)?

The most surprising thing (for me) in the Google Translator Toolkit m:machine translation experiments on Wikipedia was that they reported they were able to automatically translate even templates (names + parameters) and categories after a while. If true, this was fantastic because traslating templates is very time consuming and boring (when translating articles manually, I usually just drop templates). Do you think that was/is possible? Will/would this be possible with the machine translation service of choice or with translation memory?

Once again, if machine translation is not a big component of this project forgive me for the silly question, but update Thread:Talk:Content translation/Machine translation.

Santhosh.thottingal (talkcontribs)

Nemo, the CX front end and CX server works on html version of the content. Once the user requests and article with CX, we use Parsoid to fetch it in HTML format. From there onwards, all processing is on HTML(Segmentation for example). And translation tools data we will gather at server is also for this html. The html is presented to the user as source article. We use a content editable for translation editor. When user tries to save whatever translated, we again use parsoid to convert html back to wiki text to save it in MW. At a later point, we will enhance our basic content editable with editor like VE, but that is not in immediate road map.

About templates, parsoid does not drop these templates while converting to html. The parsoid also keep enough annotation on this html to indicate what template caused this html etc. The CX frond end will show info boxes, references, gallaries etc. But editing them will not be allowed in initial versions of CX. It will be readonly blocks. We will start editing of references to begin with -allowing users to keep/change the link and keep/translate the reference text.(An experiment https://gerrit.wikimedia.org/r/#/c/126234/)

The primary focus of CX will remain as bootstrapping new articles using translation. For the first versions of CX, we dont plan to duplicate the complex/advanced editing to this screen. For the later versions, we will try to reuse editing features from VE, keeping the projects primary focus on translation tools and not reinventing wiki editing.

Whether MT is big component of CX or not: MT will be one of the translation tools for CX. https://commons.wikimedia.org/w/index.php?title=File%3AContent-translation-designs.pdf should give more context on how we plan the translation tools. Thanks

Nemo bis (talkcontribs)

I still have no idea how the templates can work on the target wiki.

Santhosh.thottingal (talkcontribs)
Reply to "Translating HTML?"