Wikimedia Developer Summit/2018/Next Steps for Languages and Cross Project Collaboration

Dev Summit '18 https://etherpad.wikimedia.org/p/devsummit18

Next Steps for Languages and Cross Project Collaboration

https://phabricator.wikimedia.org/T183314
See task for session description, goals, and pre-reading notes

DevSummit event[edit]

Day & Time: Monday, 2:30 pm – 3:30 pm AND 4:20 pm – 4:50 pm
Room: Kinzie
Facilitator: Birgit
Notetaker(s): Leila, Niklas, Anne, Benoît, TPT

Session Notes:

Introduction by Santhosh:
- Session is divided in two parts: languages and cross-project collaboration. What WMF is doing for the languages? And what should we do for the future? The goal is to gather ideas about the upcoming years of WMF on the topic of languages. There used to be a Language team, 6 months ago it went away, and it's coming back. What are the language related projects?
  - In the past years the Language team worked on: Content Translation project: with the goal of growing wikis in the smaller languages.
  - There are many other core components the language team is responsible for:
    - Translate extension, hosted on translewiki.net: MediaWiki extensions (and more) are translated to all the languages.
    - MediaWiki core language infrastructure: handles showing translated messages, plural, gender. Number formatting (the list goes on)....
    - Input method library, language selection library, MediaWiki independent internationalization library
- here we want to discuss the future. Language is a core component we want to focus on. We want to grow wikis beyond English.
  - Many position papers focused on translation: not only translation of content, but translation in other contexts too. We need to discuss what we can do with it
  - How we can grow and develop our language infrastructure.
- please tell me about the challenges and opportunities you see.

Discussion:

Challenges
- David: For WP the greatest challenge is the reader is faced with the question of reading a more comprehensive article in English, vs. a shorter form in another langugae. Basically bias for dominance language.
- Benoit: How machine translation can help people with preserving cultural context. Cultural adaptation.
- Lucie: to understand what are the actual needs, because otherwise we will start imposing biases.
- Kaldari: most of the small languages active in Wikisource don't have access to OCR. I've been having a lot of conversations with Google on this front.
Opportunities (and challenges mixed)
- Mingli: Machine translation: has opportunity to change shape of knowledge exchange. Should set up a team to follow industry trends. Maybe a small team focused on machine learning. We provide machine translation solutions between Japanese, Chinese, and English. From our experience, even the open source models are good enough.
- CScott: For this example, I'm a native speaker of Quechua, a small native language in central America. Language in the school is Spanish. I look at my native language for something, it doesn't have an article on the topic. What I'd like WP to do is to expose the information from other languages to me in my native language or in another language I know. (changing wikis is a pain. keep me in the same place and show me what you have.)
  - Direct reply from Lydia: that's almost possible with Article Placeholder, based on Wikidata elements.
- Moriel: There is an opportunity to gather some documentation to help multi-lingual learn what the norms in every wiki is, so someone like me who can translate or create content at least get started. A tool that suggests better can be another solution. For example, the suggester can capture some of the norms of the wiki as well, not only on the language. With machine translation we have something to give to the world. We have a lot of data, conclusions, and tools to share. Once we expose this, other people will use our work more, and that can help us improve this kind of work.
- Lucie: We've been working on generating (Article Placeholder) text from Wikidata items. We have a lot of information in Arabic, for example. It is very interesting to see that there is quite a bit of overlap between languages.
- Kaldari: one of the things we overlook is the opportunities that are in our projects and we don't utilize. Wikisource, especially in smaller languages, can be very interesting. They have started bringing in a lot of scanned books and references. Wiktionary is another great opportunity. I know it's kind of very much in the future, but we should expand our imagination beyond the immediately readily available to us.
  - ??
  - Benoit: It's important that we think about machine translation, community culture. Some communities will refuse machine translation.
  - Lydia: Wiktionary is an amazing asset, and that's part of the reason we're bringing that to Wikidata now. That's the question we should ask ourselves: do we develop technology for the languages for which no one does machine translation, or shall we wait for someone else to do that?
- CScott:??
- Anne: Content and internet traffic in local languages is growing all over the web. For example, Hindi is roughly doubling YOY. There are more potential community members in these languages. We have a lot of opportunities to partner with companies in languages where the Web is smaller in. We should be hiring local language speakers.
- Leila: 1. Do we have the fonts for all languages to be able to render them? For example for Persian. How do we make sure to have good fonts? It is a challange we have to go back. 2. Machine translation. it is very close to the research that I do. On of the challenges is that e.g. for Korean we do not have good enough machine translation. Wikipedia brand is such that we could not afford to have not quasi-perfect content. Wikipedia brand is a challenge here. I've been hearing that machine tranlation is getting better for 15 years. And I'm not sure that it is. For easy/common phrases, it works OK. But for other things it's still really bad. How are you going to fix that? I am doubtful that in 2030 we are going to have machine translation.
- Volker: Fonts. What the design team in the last two years had a strong focus on was bringing user-interface to more languages and backgrounds. This is a bit of opposite postiion to CScott. We want users having the experience in their culture and language. Something that requires a lot of resources (for example, someone like Moriel who can help us with a language such as Hebrew).
  - CScott: This is not opposing to my position. I meant we shouldn't be sending people to other places.
  - Volker: Ok. Then the point I want to emphasize is that we have developed the projects based on a very western culture and set of languages. For example, the top-down languages, we haven't thought about how to accommodate.
- Niklas: May have come up already, but want to make more explicit. Nothing that exists works across languages that well. Challenges: number of languages. Some languages get technologies and others don't and it will amplify the problems we have now. This for example is reflected in the issue that machine translation is only available in certain pair of languages
- Victor: To Leila's point, Google has provided a technology called "google open fonts" to help the hosting of open fonts by Google. Languages like Chinese, Japanese, Korean with large font files can sometimes be as large as hundreds of MBs. There is a new effort released by Google to use Machine Learning to slicing up large font files into smaller one to increase page load speed.
- Santhosh: Font has two components. Tofu issue is less of a thing now, but there are also other challenges with different fonts. For some scripts you need more width or more height. Brand identity that speaks to typographic layout is important.
  - C Scott: the other issue with fonts (from PDFs) is that our platform doesn't label the language consistently. Our software could do a much better job IDing which language a specific UI element is in.
- Benoit: 1/ People in multilingual wikis can't work together if they don't speak the same language. Have ways to communicate beyond language is something to consider. 2/ If we start providing machine-translated articles, what will be the identify of the source? Will there just be one Wikipedia like there is Wikidata right now?
  - Santhosh: there are 2 types. We present to user MT that they can modify and make it more correct. There are also manual translations.
  - Benoit: People feel they are not able to contribute anymore because someone on a different wiki is doing the job (creating infoboxes, etc)
- Tpt: Wikisource is a great source for data in small languages for training OCR and machine translation.
- Mingli: We have a classics document in history, shared by humans in all languages. works of philosophy, for example. How about we set up a parallel corpus for this kind of work in Wikisource.

Daniel [not present]: When you say "Machnine Translation", I hear "Google". Are we ready to fully depend on Google? Or other big corporations? Is it feasible to build FLOSS Machine Translation? How hard/costly would that be? Could we do it? I personally think that it is NOT feasible for the WMF to build decent machine translation over the next ten years, even if we started to focus on it right now.

Tasks / actions

Santhosh: ... [tell me your ideas?]
C Scott:
- Better tagging of languages in the code base
- One backend database. In current state, I can't do UI experiments because they're in different databases.
Lucie: Involve more communities in technical solutions
Anne: hire translators.
Ryan: Make some OCR data sets for some small languages that don't have OCR available (tesseract)
- Santhosh: we don't invest in this
- Others: But we could.
- Lucie: we could collaborate with researchers who are working in this space already
Tpt: need to think about sustainability for collaborations. Probably researchers are here for 1-2 years. For OCR, even if we don't make our own system, we have one of the biggest volunteer communities and we could use Wikisource to test machine translations into text to be proof-read.
- C Scott: if we do initial work, we can turn over to community.
C Scott: Translation suggestion bot that would notice when you make an edit in a language to a passage that has parallel text in another wiki and suggest the same edit in that wiki to an editor on that wiki.
Santhosh: make Wikipedia beautiful (or at least functional) in all scripts.Typography and layout per languages
- Volker: Typography is one of the points of...
- C Scott: What does it take to have one wiki where people can work in different languages?

Cross project collaboration

Lydia: we need to support small and medium sized wikis better than we do now. The example of Wikidata serving everyone. Omg there is this problem in Wikidata. There is the same problem with Commons. What can we do to make that better?
- Benoît: Reply from some community members: let's just kill Wikidata.
- Lydia: Wikidata has to be a central piece of that.
- Benoît: If we manage to have discussion between wikis If people don't understand what other people. There is an elitism between languages and wikis. E.g. there is no link to some Indian languages from enwiki main page because they are "not quality enough". The biggest change is is social.
Mingli: Question: push back data to Wikidata.
Ryan: Editing English Wikipedia is very different. Can we improve the social climate?(?)
Lydia: Wikidata is done in order to avoid fights. For example it is easy to state different POV.
Ryan: [??] have more social actionable tools, like blocking
Benoît: Short or long descriptions? Who is right?
Matt: [we need to address]? English Wikipedia's specific need concerning software requests and [we don't really care about the others]. Most people don't speak English in the world. We should not be enwiki centric, and cancel projects because they are not going to be used on enwiki.
- Lydia: It is not the case that I need English Wikipeida to care about Wikidata. But still it would be a shame if Wikidata does not benefits to more people.
Lucie: Wikidata is really focusing on Wikipedia. We miss cross wiki and cross lingual projects because of that.
- Lydia: German Wikivoyage is using Wikidata a lot
- Benoit: Russian Wikisource is the 3rd largest user of Wikidata (written note: First is Commons, 2nd is Chinese Wikipedia http://wdcm.wmflabs.org/WDCM_UsageDashboard/)
Mingli: [...] Edit Wikidata instantly. If Wikipedia has a convenient API to edit an item, then [...]?
Matt: Wiktionary and Commons are in progress to being converted to Wikidata. It would be nice to add Wikispecies.
- Tpt: I don't know if Wikispecies is yet using Wikidata.
- Lydia: Some outreach, not succesful. Workshop later this year(?).
Tpt: Unable to create global (cross-wiki) templates and gadgets. +100
Matt: We are going to discuss global templates and gadgets tomorrow in session Growing the MediaWiki Technical Community
- Lydia: does someone know what is the status of these?
- crowd: We have wanted them for decades
Birgit: anything that could be action items?
Benoît: structured talk pages, cross wiki talk pages that allow translations - get rid of those blank talk pages
Lucie: It would be good to find a way for different communities to interact on Wikidata: if you don't speak English we could not be part of the core community.
- Lydia: What else we could do for Wikidata and Commons beside the language switching we have
- Benoît: Use automatic translation and pre-defined i18n-ed messages would be a way to ease communication. I am not a scientist, but if someone would do research about that. Simple words may help. [Lucie: or use emojis!]
SJ: This is the essential project: How to talk to one another across projects & languages. A single [project] that needs a collaboration of devs in translation (automated and hand-confirmed), database (MW structure), and interface (showing/filtering many sources of chat).
Lydia: Does anyone has ideas to solve things like German Wikipedia does not want to use Wikidata because Russian Wikipedia "is bad"?
- SJ ... Let people filter by source... if not ru:wp there will always be sources which different communities consider more or less reliable
- Lydia: ... That's not how this was supposed to work...
Tpt: ...
Matt: Alternatively we could tell these people they are wrong :) /s?
Lucie: ... reference?
Lydia: ...
Tpt: Concerning automated translations, we need to have it good enough to avoid misunderstanding issues.
Ryan: Build a few ui components that allow editing Wikidata from Wikipedia. Have an easier way to add references. Edit Infoboxes that edit Wikidata. (?)
Matt: The quality of machine translation varies massively. How do you automatically measure the quality. Currently Content Translation is pretty strict at warning you about unedited machine translation.
- Lydia: Facebook asks "did you understand this"? Translation rating.
Matt: on CX - case of people who make crappy translations. Allow trusted users?
Mingli: I can do a demo about quality of machine translations my startup has developed.