Talk:MinT

Add topic
From mediawiki.org
Latest comment: 1 hour ago by MohammedBama123 in topic Needs improvement for Kanuri language

Welcome to the MinT page[edit]

You can use this page to start a discussion with others about how to improve MinT. Thank you! UOzurumba (WMF) (talk) 21:16, 25 September 2023 (UTC)Reply

dead link for 'The IndicTrans2 project'[edit]

IndicTrans2. The IndicTrans2 project leads to -> https://ai4bharat.iitm.ac.in/indic-trans2 -> Oops! That page can’t be found.
Can someone realign please ? Thanks. -- Christian 🇫🇷 FR (talk) 16:42, 28 September 2023 (UTC)Reply

Hello Christian 🇫🇷 FR,
I checked the link and it seems to be working fine now. Thank you! UOzurumba (WMF) (talk) 19:54, 4 October 2023 (UTC)Reply
ok now too Yes Done. We leave unchanged. Thanks for ACK. Christian 🇫🇷 FR (talk) 06:25, 5 October 2023 (UTC)Reply

MinT should match with translation database we have contributed, but does not.[edit]

De-deploy MinT please for en-ja translation, or give me a "stop" button to do without it. I need an option to stop MinT, and where can I do that? Are we sure the problem owes that MinT is a combination/ collaboration of two systems? Is there any language pairs that it outputs acceptable translation? When are we going to import the translation database from the previous system? That thesauri is very precious as the circle of translators has spent so many hours building it.

Again, I am talking about the en-ja language pair, and it is not practical to keep using it regardless of tech subjects or not. (details below) CX2 has been nulled for ja users, however, for vocabulary matter, it worked much better.

At the moment, if I am not turning off MinT on en-ja translation for Tech News:

  • I have to open past issues, c&p correct expressions;
  • Tech News has so many set phrases/expressions endemic to it, like iterating updates and so forth:
  • Wki markup is not only neglected but replaced to wrong characters; bold ''' to plain quotation marks". Why such very primitive error is present?
  • I feel not confident as so many sentences need to be manually c&p from past issues, which does not sound in line of our attribution policies to me as an Wikipedian.

If we need to invest and train the new MT system and its dictionary, does it paid from the pockets of translators? Do we use MinT with bitterness on our tongues, till we see MinT usable?

I appreciate how the MT system takes care of matching with the translation database, exactly why a low level system should be turned off for certain language pairs AFAIK. In my personal perception, I *need* to neglect MinT's suggestion approx. 85% of the time, and reasoned as:

  • 40% of it because it does not parse grammar correctly, inserts symbols I need to delete manually;
  • 35% of it because its dictionary is not match Wikimedia specific terminologies, which translators had trained the previous system;
  • 20% of it replaces wiki-markup wrongly; as above, for bold letters, ''' needs to stay as is , but MinT replaces it to plain quotation marks".
  • 5% that I can't trust its dictionary, or for a country name Belarus, MinT outputs Belgium. /: What kind of a bug can induce such primitive error?

MinT is below my expectations as an en-ja translator. Too bad I will not enjoy the MT assistance any more, while the old system has pampered its user, or me, by saving working times almost 40%.

FYI, my usecase:

  1. With the design of Tech News, translating from scratch is wasteful: iterated info should keep the sentence format and keep our readers for /ja pages affirmed that translators understand what we are doing.
  2. On ESEAP issues, the original text in en is actually an en output translated from the native language of the poster; means that much guess work is involved supplying secondary translation, or looking into wikidata helps me many times to match strange terminology to organization names or wiki teams.

Crossing my fingers that other language pairs are not affected this badly. Cheers, --Omtecho Omotecho (talk) 06:17, 30 September 2023 (UTC)Reply

Thanks for the feedback, @Omotecho.
MinT is a new initiative still in active development. It is not replacing any previous system: the suggestions from Translation Memory or other services like Apertium are still shown, when they are available. The translation memory (previous translations by editors to similar messages) are given priority, shown above the machine translation ones (in this example MinT suggestions are shown at the bottom of the list).
MinT uses different machine learning models to produce the translations. I'll provide more detail on some of the types of issues you are experiencing:
  • Translaiton models models support plain-text translation, and we are building support for more complex formats such as HTML and Wikitext on top of them. For example, improvements to support Wikitext are captured in this ticket. The issues with Wikitext can result in both (a) markup not showing corretly in the result and (b) contents being wrongly translated because markup gets in the way (e.g., resultng in a sentence being cut in half and translated independently, which leads to wrong translations). As Wikitext support is improved, these issues should reduce significantly.
  • For machine learning models the quality of the translation depends on the amount and quality of the training data. By providing more examples of good translations, the models can be improved. Currently, translating Wikipedia articles with Content Translation or contributing to Tatoeba are two easy ways to generate more quality data to improve the models. We also plan to integrate localization data from the Translate extension (more details in this ticket). In addition, contributing more Wikipedia-specific data will result in translations that align better with the community expectations.
As I mentioned, MinT is in active development and it has room for improvement, but for polishing a system that supports over 200 languages it is very useful to expose it to the communities in ways that they can help make it better.
Thanks! Pginer-WMF (talk) 08:37, 5 October 2023 (UTC)Reply
@Pginer-WMF, hello, as Sharing Free Knowlege is the initiative we both share, which is why I am disturbed by MinT. As I have a background as a dictionary/translation database editor at a private MT developer more than 20 yrs ago, and reading papers on MT development ever since, I can't agree with you.
For content translation, mind you I have tried to support and fill the samples, before MinT came into our view. However, ja speaking community did not find it necessary.[1] And cleaning the mess careless users had brought in with CX2 sits heavy on the same community.
When the MT engine is not suited as concept itself, we can't train it or make it usable, and MinT falls under that category. As you are aware, ticket is filed on T348361 on this matter: On meta, I and Lemonaka had filed a RfC to stop MinT on en-ja translation.
I wish you or anybody from your team would join discussion on AAMT aka Asia Pacific Association for Machine Translation as specialist with reliable data. I believe WMF tech team is much more experienced dealing with Good Faith users or those eager to contribute to the largest digital Encyclopedia and scientific entry on the catalogue of Species, and many of us are not experts in all fields, but who want to share Human Knowledge. Which makes WMF tech teams very unique in the field of MT AFAIK, compared to those major PC manufacturers as well as software giants who target commercial users.
FYI, the commercial MT systems have not inflated their market in past 20 yrs globally in regards to en/ja language pair, compared to other pairs in the market. Some claim theirs as the best, but in very narrow field of topics they specialize. A number of users off-wiki support particular application, but that is no proof that any one of those MT app is superior to other MT apps.
Or anyway, users will c&p from other web MTs and produce fake translation like this one.[2] Then looking back at MinT, why do we keep an inferior system which does not par even with that low quality? What do we gain as we limit the discussion to en-ja language pair?
Kindly, --Omotecho Omotecho (talk) 16:07, 7 October 2023 (UTC)Reply

translate.wmcloud.org inaccessible[edit]

The test instance mentioned on this page seems to be inaccessible. It returns a generic Wikimedia Cloud Services "cannot be reached" error page. Chlod (talk) 04:03, 12 October 2023 (UTC)Reply

Maybe connection was busy? It works and tried with en-ja language pair. Omotecho (talk) 14:56, 12 October 2023 (UTC)Reply
Looks like it works now. Must have been a hiccup. Thanks for checking! Chlod (talk) 02:30, 21 October 2023 (UTC)Reply

Excerpts on NLLB-200 model card[edit]

NLLB-200 is a research model and is not released for production deployment. NLLB-200 is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation.

—NLLB Team "No Language Left Behind : Scaling Human-Centered Machine Translation" page 183

Meanwhile MinT is clearly using NLLB-200 model

Translated in 2.07 seconds by nllb200-600M model

—Footnote notice on MinT, after the translation is complete.


Rtnf (talk) 14:29, 24 October 2023 (UTC)Reply

Is *new* translation memory still being created?[edit]

With MinT, are new translation "memories" being created? That is, if I am translating a longer project (in my case, a whole course, made up of multiple pages and videos [as subtitles]), would MinT help encourage consistency in terminology based on the way certain terms were translated earlier? Asaf (WMF) (talk) 16:54, 14 November 2023 (UTC)Reply

Hello Asaf (WMF),
Currently, there is no translation memory. However, this feature request is known and captured in this ticket: https://phabricator.wikimedia.org/T96165. UOzurumba (WMF) (talk) 04:51, 20 November 2023 (UTC)Reply

Needs improvement for Kanuri language[edit]

I tried the MinT to translate some articles into Kanuri, but most of the translations are not correct! hope to be improved for better experience. MohammedBama123 (talk) 12:31, 21 November 2023 (UTC)Reply

Yes kanuri languge need improvement theres lot of erros Umargana1 (talk) 20:55, 10 February 2024 (UTC)Reply
Thank you, @MohammedBama123 and @Umargana1 , for your feedback, and I apologize for the late reply. So, would you say that Machine translation model in the MinT is not a good aid at all for translation? Maybe you can give an idea of how bad it is on a scale of 1 to 10. UOzurumba (WMF) (talk) 17:17, 23 May 2024 (UTC)Reply
hi @UOzurumba (WMF) I can rate it 5/10 because it just keep repeating words that it didn't understand. MohammedBama123 (talk) 17:52, 23 May 2024 (UTC)Reply
Thank you, @MohammedBama123, for your reply. I have noted the repetition of words; with usage, the quality of the machine translation will improve.
UOzurumba (WMF) (talk) 15:40, 26 May 2024 (UTC)Reply
Alright thanks for the update. MohammedBama123 (talk) 16:23, 27 May 2024 (UTC)Reply

Tamil Wikipedia Content Translation[edit]

I have checked Tamil Wikipedia Content Translation and I took the sample from en.wiki page, Supreme Court of India.

The source was:

The Supreme Court of India is the supreme judicial authority and the highest court of the Republic of India. It is the final court of appeal for all civil and criminal cases in India. It also has the power of judicial review. The Supreme Court, which consists of the Chief Justice of India and a maximum of fellow 33 judges, has extensive powers in the form of original, appellate and advisory jurisdictions.

Result from IndicTrans2 machine translation model:

இந்திய உச்ச நீதிமன்றம் என்பது இந்திய குடியரசின் உச்ச நீதித்துறை அதிகாரம் மற்றும் மிக உயர்ந்த நீதிமன்றமாகும். இது இந்தியாவில் உள்ள அனைத்து சிவில் மற்றும் கிரிமினல் வழக்குகளுக்கான இறுதி மேல்முறையீட்டு நீதிமன்றமாகும். நீதித்துறை மறுஆய்வு செய்யும் அதிகாரமும் இதற்கு உள்ளது. இந்திய தலைமை நீதிபதி மற்றும் அதிகபட்சம் 33 சக நீதிபதிகளைக் கொண்ட உச்ச நீதிமன்றம், அசல், மேல்முறையீட்டு மற்றும் ஆலோசனை அதிகார வரம்புகள் வடிவில் விரிவான அதிகாரங்களைக் கொண்டுள்ளது.

Result from Google translation:

இந்திய உச்ச நீதிமன்றம் என்பது இந்தியக் குடியரசின் உச்ச நீதிமன்ற அதிகாரம் மற்றும் உச்ச நீதிமன்றமாகும். இது இந்தியாவில் உள்ள அனைத்து சிவில் மற்றும் கிரிமினல் வழக்குகளுக்கான இறுதி மேல்முறையீட்டு நீதிமன்றமாகும். நீதித்துறை மறுஆய்வு செய்யும் அதிகாரமும் இதற்கு உண்டு. இந்தியாவின் தலைமை நீதிபதி மற்றும் அதிகபட்சமாக சக 33 நீதிபதிகளைக் கொண்ட உச்ச நீதிமன்றம், அசல், மேல்முறையீட்டு மற்றும் ஆலோசனை அதிகார வரம்புகள் வடிவில் விரிவான அதிகாரங்களைக் கொண்டுள்ளது.

Unfortunately, both look artificial, not natural; also both use transliteration, not proper Tamil word. AntanO (talk) 07:24, 18 January 2024 (UTC)Reply

Add Dobrujan Tatar[edit]

Hi there,

I have a request to add Dobrujan Tatar. It is seen as a "form" of Crimean Tatar, similar like Tajik and Persian. For alphabet and grammar this book can be helpful and many translations are to find in Dobrujan Tatar — Romanian dictionaries, Dobrujan Tatar — Latin dictionary (ornithology) and Dobrujan Tatar — Latin dictionary (botanic). Some examples of translations can be found in translated books:

Zolgoyo (talk) 15:07, 2 February 2024 (UTC)Reply