Talk:Content translation/Specification

Jump to navigation Jump to search

About this board

Nemo bis (talkcontribs)

I'm not sure how to contact the development team... I'll try this talk page this time.

I've noticed there is a Yandex backend patch in the works. Is there a bugzilla report about it? I don't know if it's just for testing or what, but as you know the Wikimedia projects are very sensitive to privacy and free software, so it's important that before entering any use this is announced widely, with rationale etc. People will have questions, like: is this the first piece of proprietary software ever entering use in the Wikimedia projects land?

An important page to remember is the official directory of the software in use by Wikimedia organisations, m:FLOSS-Exchange. It must be kept up to date.

Pginer-WMF (talkcontribs)

Thanks for raising those points. We are aware of the community concerns about the use of external proprietary services.

We have shown that Open Source translation engines are our big priority by integrating Apertium already. Unfortunately, Apertium supports about 36 language pairs, while other services such as those provided by Microsoft or Yandex support about 1936 language pairs (all possible combinations of 44 languages).

Increasing the language coverage is important to help users to expand the sum of all human knowledge using Content Translation. The current patchset is intended to test the use of multiple services and it is using Yandex just as a convenient example. This does not mean that integrating that specific service is in our immediate plans.

It is worth noting that in any case we are not including closed source software in our codebase. The interaction with external services is based on sending the existing content of an article (not including information about the user or any other user input). This is not very different from other examples such as location template linking to Google maps in addition to OpenStreetmap. In any case we are in conversations with the Legal department to make sure there are no problems in that regard.

Beyond technical and legal issues, we won’t be deciding in the name of the communities. We think it is our duty to prepare a software so that it can help users from as many languages as possible by accessing to as many services as we can (and we welcome the community to integrate all kinds of services in our platform), but each community will have their say and would be able to configure which services are made available. Even more, individual users will be able to select their preferred translation service from those offered (and we’ll keep any Open Source option available as the default).

Nemo bis (talkcontribs)

This answer was completely oblivious of my question. I see Yandex is now proceeding, and I didn't see any update sent to the main community venues, nor an update of the appropriate documentation pages.

No, a Commons template is not like MediaWiki software. No, hyperlinking is not the same as embedding a service over the network.

Runab WMF (talkcontribs)

Thanks for bringing this up Nemo. It is understandable that there will be concerns like yours when we explore services such as Yandex. As Pau had earlier explained, these are technical explorations which are important to expand the scope of what the tool can provide and benefit more users in the process. This is currently restricted to the beta server environment for further testing. The Language Engineering team is by no means an expert in matters of legal nuances related to MediaWiki software and policies about third party services that affect our community. We have been interacting heavily with the departments within WMF (Legal, Community Advocacy and others) to make sure no terms are violated at any step. However, in view of these concerns if you recommend that there is a need for an in-depth dialogue with you (and others who will be relevant to the conversation) we will be happy to take that up with our Legal team.

Reply to "Yandex backend"
Gryllida (talkcontribs)

meta:Grants:IdeaLab/External Translate is an idea with roughly same goals. It has not been brought, not had been planned to, further than a WMF Labs tool; while it's in the grants section, it had only been placed there as IdeaLab is a good place to share ideas. Please be free to take any of its design ideas or code for this project.

Reply to "Parallel work ..."
Nemo bis (talkcontribs)

I've checked all subpages but I don't think there is any answer to this question: how are you going to include a machine translation service in this product? There are only some incidental mentions of this, one is «Example task: Contact yandex and get the translation for one or more segments-depending on provider capacity». So there will be server-side requests to proprietary services, with a fee paid by the Wikimedia Foundation? I don't understand, if this wasn't possible for Translate on Meta-Wiki etc. how/why is it going to be possible here?

If machine translation is not a big component of this project forgive me for the silly question, but update Thread:Talk:Content translation/Machine translation.

Santhosh.thottingal (talkcontribs)
Nemo bis (talkcontribs)

Nice that there are entities helping us but after reading the blog post I have no idea whatsoever of what the collaboration consists of concretely. :)

Reply to "What sort of machine translation"
Nemo bis (talkcontribs)

I came to this project today after reading a very surprising (to me) statement : «copies the HTML and saves it to the user page. It must be changed so that the edited HTML would be converted to maintainable wiki text before saving». So you plan to translate HTML rather than wikitext? I never heard of this and I couldn't find a mention of this crucial architectural decision anywhere; only an incidental mention in a subpage, «ability of MT providers to take html and give the translations back without losing these annotations».

If you're translating HTML, why? Because of VisualEditor I suppose? Are the annotations above those that parsoid uses?

Does this mean that you're not going to translate wikitext at all and that you'll just drop all templates, infoboxes and so on? Did anyone before ever try to reverse engineer HTML to guess the local wikitext for it? Isn't it like building thousands of machine translation language pairs, one for every dialect of wikitext (each wiki has its own)?

The most surprising thing (for me) in the Google Translator Toolkit m:machine translation experiments on Wikipedia was that they reported they were able to automatically translate even templates (names + parameters) and categories after a while. If true, this was fantastic because traslating templates is very time consuming and boring (when translating articles manually, I usually just drop templates). Do you think that was/is possible? Will/would this be possible with the machine translation service of choice or with translation memory?

Once again, if machine translation is not a big component of this project forgive me for the silly question, but update Thread:Talk:Content translation/Machine translation.

Santhosh.thottingal (talkcontribs)

Nemo, the CX front end and CX server works on html version of the content. Once the user requests and article with CX, we use Parsoid to fetch it in HTML format. From there onwards, all processing is on HTML(Segmentation for example). And translation tools data we will gather at server is also for this html. The html is presented to the user as source article. We use a content editable for translation editor. When user tries to save whatever translated, we again use parsoid to convert html back to wiki text to save it in MW. At a later point, we will enhance our basic content editable with editor like VE, but that is not in immediate road map.

About templates, parsoid does not drop these templates while converting to html. The parsoid also keep enough annotation on this html to indicate what template caused this html etc. The CX frond end will show info boxes, references, gallaries etc. But editing them will not be allowed in initial versions of CX. It will be readonly blocks. We will start editing of references to begin with -allowing users to keep/change the link and keep/translate the reference text.(An experiment https://gerrit.wikimedia.org/r/#/c/126234/)

The primary focus of CX will remain as bootstrapping new articles using translation. For the first versions of CX, we dont plan to duplicate the complex/advanced editing to this screen. For the later versions, we will try to reuse editing features from VE, keeping the projects primary focus on translation tools and not reinventing wiki editing.

Whether MT is big component of CX or not: MT will be one of the translation tools for CX. https://commons.wikimedia.org/w/index.php?title=File%3AContent-translation-designs.pdf should give more context on how we plan the translation tools. Thanks

Nemo bis (talkcontribs)

I still have no idea how the templates can work on the target wiki.

Santhosh.thottingal (talkcontribs)
Reply to "Translating HTML?"
Nemo bis (talkcontribs)

From the page I don't understand: is machine translation a big part of this project, or just something that happens to be integrated in it like many other translation aids in the Translate extension, so that we need to keep an eye on it ("Warn automatic translation abusers" is the only mention of specific machine translation features)? If it's somehow important, how does it relate to previous experiences, for which we're compiling a list of links at m:Machine translation#See also?

Siebrand (talkcontribs)

We think that having ways to bootstrap a translation is a big deal. It saves translators a lot of time. Can you be more precise in your asking of "how does it relate to previous experiences"? We want to make it very visible what was translated, by whom, and which percentage of the published text is machine translation.

Nemo bis (talkcontribs)

How does it relate as in: do you consider any of those past projects, if yes is this project (comparable hence) similar or different in any way, if yes how similar or different?

Reply to "Machine translation"
There are no older topics