Wikimedia Language engineering/OpusMT Outreach

From mediawiki.org

Overview[edit]

The University of Helsinki's OPUS project focuses on creating tools and resources for open-source machine translation services. They collaborate with the Wikimedia Foundation to build models for languages with limited resources, using data from various sources. Wikimedia's MinT translation service relies on the OPUS model and is available in 200+ languages. Their ongoing collaboration aims to develop models based on parallel corpus information for low-resource languages.

In the future, this can also help expand Wikimedia’s MinT (machine translation service) to smaller language communities. This proposal involves an outreach plan to:

  • Gather parallel corpus information from smaller language communities that do no have a Wikipedia yet and may or may not be in Incubator.
  • Obtain feedback on current machine translation quality.
  • Promote ideas to language communities to contribute to the Opus-MT models.

Outreach plan[edit]

1. Parallel corpus[edit]

Status and resources for less widely spoken languages (e.g., Cornish language) may be limited compared to major languages. In some cases, machine translation may not be supported at all, or contributors may not even know what parallel corpus means.

To find parallel corpora for a low-resource language, reaching out for information in the following Wikimedia venues could be helpful:

  • Mailing lists and Telegram channels to connect with people in language communities and beyond

Two text files with same number of lines, each line matching the source and target language is the basic format for parallel campus (e.g., en-hi_sample.html). It might as well be useful to consider OLDI standard. Ideas for language communities where they can search for corpus:

  • Academic databases & repositories
    • ELRA (European Language Resources Association)
    • CLARIN
    • Other linguistic archives
  • Language preservation and promotion organizations that might have compiled language resources or have information on where to find them.
  • Corpus linguistics platforms, such as Sketch Engine, might have resources for less common languages.
  • Online language communities dedicated to the specific language enthusiasts. Members may be aware of available language resources, including parallel corpora.
  • Universities and research institutions that focus on linguistics, natural language processing (NLP), or machine translation. They might have compiled parallel corpora for various languages.
  • Local or government language revitalization or preservation initiatives.
  • Contact linguists, researchers or language experts (e.g., through NLP conferences). Check publications and research papers related to specific language research. Authors may have shared language resources as part of their research.

[Stretch] Engage with open source machine language communities to find out more on the development of parallel corpus for low-resource languages:

  • KDE localization - maintains parallel corpus for various languages. OPUS is using KDE corpus.
  • OpenSubtitles project - provides parallel corpus for movie and TV show subtitles. OPUS is using OpenSubtitles corpus.
  • GNOME translation project - also engages in translation efforts, fostering creation of parallel corpus for different languages. OPUS is using GNOME corpus.
  • Paracrawl - corpus for European languages. OPUS is using Paracrawl corpus.

Documentation[edit]

Parallel corpus for low-resourced languages can be documented on MinT’s sub page “OpusMT Outreach”.

2. Machine translation quality feedback[edit]

Communities can be invited to create test cases for future models by defining a set of example translations that can be later used to compare against new translation models. These can be based on good examples (similar to the parallel corpora examples) or focused on capturing issues of the current model (more related to quality feedback below).

Communities can be invited to try out MinT test instance: translate.wmcloud.org and gauge initial feedback on quality of machine translation for a particular language:

  • Are there examples that identify patterns of consistent issues that users commonly find in machine translations? For example:
    • Is the translated text primarily grammatically correct and natural in the target language?
    • Does the translation take into account the context and cultural nuances of the original text?
    • Is the machine translation accurate in conveying specialized terminology in domain-specific content, or are there instances of mistranslations?

Documentation[edit]

Feedback around machine translation quality feedback can be collected on Help_talk:Content_translation.

3. Contributions ideas[edit]

Share ideas with language communities on contributing to the open corpus.

For direct contribution, contributing to Tatoeba or translating Wikipedia articles with Content Translation seem two of the simple ways to generate data that will be incorporated into OpusMT.

[Stretch] How can we stimulate engagement around this and monitor its effects?

Timeline[edit]

April–June 2024

  • Reach out to five language communities that:
    • Do no have a Wikipedia yet
    • May or may not be in Incubator
    • Lack quality in their machine translation services
  • Gather information on parallel corpora by discussing ideas on potential sources with the language community (see "Parallel Corpus" section above for guidance).
  • Assess the quality of machine translation from these language communities.
  • Promote contribution ideas to the community.

July–September 2024

  • Recruit five new languages and repeat the process.

Potential stakeholders[edit]