Wikimedia Language engineering/Pune LanguageSummit November 2013/Event Report

The Fall 2013 edition of the Open Source Language Summit was held in Pune, India on 18-19th November 2013. The event was organized by Wikimedia Foundation’s Language Engineering team along with Red Hat at the Red Hat engineering center.

Participation[edit]

Wikimedia Language Engineering, VisualEditor and Mobile teams as well as language technology team members from Red Hat, Google, Microsoft Research, Adobe, Mifos and open source developers from various Open Source communities including Swathanthra Malayalam Computing, Ankur India, IndLinux, Fedora, Debian, Wikipedians from various Indic language communities as well as Google Summer of Code students participated in the work sprints at the 2 day summit.

Sessions[edit]

During the 2 days of the event, collaborative work-sessions were conducted for improvements in cross platform language support, desktop and web fonts, input methods, on-screen keyboards, content translation, and language aids like dictionaries, glossaries. Methods and tools for testing internationalized web applications were also discussed. Extensive hands-on sessions were also held to extend FUEL terminology word-lists.

Session Details : Fonts[edit]

Sessions held included:

Session Name: Indic Font Specification
Session Name: Autonym Font
Packaging Fonts
Updating Lohit2 fonts to conform with the new Open Type spec for Indic scripts
Q & A with Behdad Esfahbod: State of the Union on Harfbuzz

Several work-sessions focused on improving coverage of available fonts across desktop, web and mobile platforms. The latest freely licensed Aksharyogini font for Devanagari was presented and technical improvements were discussed by participating font experts. Santhosh Thottingal presented the recently released Autonym font that was created by the Wikimedia Language Engineering team to simplify display of language names on Wikimedia websites. During the sessions, webfonts not available in Linux distributions like Fedora and Debian were identified and submitted for packaging [1], [2], [3], [4], [5]. This would significantly improve native font support and complement webfonts for multilingual web content on Wikipedia pages. Kartik Mistry, Pravin Satpute, Vasudev Kamath and other participants will be following up on the bugs filed during the sessions.

At the Language Summit held earlier this year, Santhosh Thottingal and Pravin Satpute had initiated a project to document the technical specifications of fonts for India’s language scripts. The project - named Fontbook, was based on the Open Type font specification. The specification consists of sections common to the scripts as well as sections specific to each script. These sections were expanded and recent recommendations from organisations like W3C and TDIL were discussed for inclusion. The project has been moved to a public repo on github and participants from more Indian languages are being invited to contribute. Over the next few months, the specification will be extended to at least 8 Indian languages.

Pravin Satpute and Kartik Mistry led the work-sessions on applying technical specifications of the Lohit font family to other Indian language fonts such as Samyak. The Lohit font-family, used as the primary font for a large number of Indian language scripts for Fedora and Red Hat Enterprise Linux, has been significantly tweaked over the years to seamlessly render the complex Indian scripts across platforms. During the Language Summit, Pravin Satpute and Sneha Kore presented on their work for the next version of the Lohit font family to comply with the latest Open Type 1.6 specification and with the Harfbuzz-ng rendering engine which will make modifications applied for previous versions redundant. It is expected that this effort will complement the extended specification to be accomplished through the Fontbook project.

Lead developer of the Harfbuzz project, Behdad Esfahbod joined in remotely and presented on the current level of font rendering and support on Chrome, FirefoxOS and Android. Esfahbod envisions better support for webfonts and discussed cross-platform testing practices for Indian scripts, especially with large volume of content like Wikipedia pages.

Session Details : Input Methods & Onscreen Keyboards[edit]

Session List:

Input Methods on VisualEditor (includes jQuery.ime integration)
Keyboard layout Images for documentation of input methods
Onscreen Keyboards
ibus-typing-booster - predictive text typing system

Work-sessions on input methods were focused on various aspects like onscreen keyboards, predictive typing, and input method help. A separate session focused on improving input on the Visual Editor for non-latin scripts with feedback from implementation so far.

Interaction Designer Pau Giner and Google Summer of Code student Praveen Singh hosted a 6-8-5 ideation session on gathering ideas to make on-screen keyboards more useful. Suggestions from the audience included a transliteration typing tool, addition of context menus to enhance spell checkers and dictionaries. Earlier, Praveen Singh showcased on-screen keyboards for the jquery.ime library, kicked off as a Google Summer of Code project earlier this year mentored by Santhosh Thottingal. Ideas gathered during this ideation session are expected to provide new feature guidance for the jquery.ime on-screen keyboards currently in development.

Parag Nemade led the session to create images of input method layouts using a script written in Python. The script currently works only for layouts that follow a one-to-one character mapping of keys. During the session, input methods in the ibus and jquery.ime libraries, that presently miss layout images and can be mapped through the script, were identified and created. Discussions during the retrospective revealed that current users do not have a quick way to provide feedback about their experience while using input methods. Over the next few weeks, more help images will be created and the Python script will be modified to extend this functionality for input methods not using the 1:1 key mappings.

Anish Patil, from the Red Hat internationalization (i18n) team showcased the indic-typing-booster, a predictive typing method developed and maintained by the i18n team. Several bugs were identified during the tool walkthrough. Anish also walked through the web-word-edit project through which a list of words can be curated and validated for use in systems that rely heavily on suggestions from large word lists. The indic-typing-booster currently uses the word-lists used by the Hunspell dictionaries and the web-word-edit is an effort to improve the typing predictions. The project is available at http://webwordedit-wwe.rhcloud.com/ and is open for participation.

The session led by David Chan and Santhosh Thottingal on enabling more input methods on the VisualEditor was one of the highlights of the first day. The session brought together engineers from the Wikimedia Language Engineering and Visual Editor teams. Indian language Wikipedians who have been providing significant feedback since the VisualEditor was enabled on Wikimedia websites also contributed. David demonstrated the event logger system built for capturing IME input events which is being used as an automated IME testing framework available at http://tinyurl.com/imelog to build a library of similar events across IMEs, OSs and languages. Santhosh stepped through several complexities of handling input to support the VisualEditor’s inherent need to provide non-native support for special handling of language content blocks within the contentEditable surface which are tough use cases. He also walked through how jQuery.ime can support VisualEditor’s needs, as it does not operate differently for each script, operating system or browser. This was followed by a brainstorming and ideation discussion during which possibilities of using onscreen keyboards, predictive typing, handwriting recognition, dictionary based auto-completion, special support features for Indic scripts on VE. Issues surrounding use of Indic languages like bilingual use etc were also highlighted. David made a call for more participation through the IME log form to collect special use cases and will also gather more insight by learning from the ibus system with Parag Nemade’s help.

Session Details : Language Support and Testing[edit]

Session List:

Cross project coverage for basic language support components
Testing Internationalized Apps

Wikimedia Foundation’s Google Summer of Code student Harsh Kothari and his mentor Runa Bhattacharjee and Sucheta Ghoshal demonstrated the Language Coverage Matrix Dashboard. The project initiated during the last Language Summit in February 2013, aims to display the status of language support on Wikimedia projects through data and visualizations. The team highlighted special cases in the Wikimedia language projects which deviate from the standard language support on other environments like on desktops that use locale-definition. Pravin Satpute presented the desktop i18n support requirements that are followed for enabling support for languages/locales on Fedora. Both the LCMD development team and the Fedora team identified their next steps during the session. The LCMD team will extend the database to better support queries, introduce reports and work on a maintenance plan and roadmap. The Fedora team, that currently provides resources for fewer languages than the Wikimedia projects will identify the gap using the LCMD data and assess the resources that can be leveraged for enhancing the support on Desktops.

Later in the day, Amir Aharoni led the session for testing internationalized applications. He demonstrated the current workflow that the Wikimedia Language Engineering team uses to prepare the tests - automatic and manual, for language features in development and for periodical releases like monthly release of the MediaWiki Language Extension Bundle (MLEB). He also ran a few tests locally on the Universal Language Selector (ULS). David Chan of the VisualEditor team presented the IME event capturing logger using which he intends to build an event library for automated IME testing framework. More languages are intended to be supported with short and long strings to provide a wider coverage of events. Next steps for the Language Engineering team include stabilization of the current set of tests being automated and addition of more tests.

Session Details : Content Translation[edit]

Sessions Included:

Leveraging content translation platforms for Indic languages
Content Translation UI prototype testing session
FUEL Sessions
Fedora SIG - UI Source Message Contextualization

Dr. Kalika Bali from Microsoft Research Labs presented on leveraging content translation platforms for Indian languages. Currently there exist language pairs for 41 languages for Microsoft translate. She highlighted that for Indic languages, research and training of machine translation is fairly recent and is not handled well in Bing and Google Translate apps. A major area where MT could be improved significantly for Indic languages as well as long-tail languages is by using web-scale content i.e. data in multiple languages were used for training and improving accuracy for MT.

Pau Giner led a testing session to gather feedback about the content translation user interface prototype. The user interface prototype is anticipated to be supported as a beta feature to provide the multi-lingual user a glimpse of the actual user interface. In this session, he conducted the tests with 3 participants from different language communities. The responses from the users and actions have been noted by Jared Zimmerman.

Several sessions were dedicated to enhancing the FUEL word-lists across the 2 days. Three new modules - color, number and datetime were recently added for evaluation. The terminology words in these 3 lists were evaluated by Red Hat Language team members and Siebrand Mazeland. The list for colors was concluded to be divided into 2 groups - basic and extended will be drawn up from the W3C list for CSS v3[6][7][8]. The datetime module is already covered through CLDR and the numbers module is partly out of scope. The discussions on both these modules will continue over the FUEL project mailing lists. In another session Rajesh Ranjan presented the Translation Quality Assessment Matrix which lists several error points in translated content that can be checked by an editor.

Shankar Prasad led a session on the Fedora - Special Interest Group for message contextualization. The groups aims to work closely with upstream developers who are initial authors of user interface messages and find ways to provide information to add context to the user interface messages for improving translations of the applications in Fedora applications. This feature is already present in translatewiki.net (message documentation). Siebrand Mazeland shared his insights and experience from translatewiki.net with the Fedora team.

Session Details: Mobile[edit]

Session List:

Rendering of fonts on mobile
Mobile Input Methods - The next generation

On day 2, several aspects of language support on the mobile platform were showcased. Juliusz Gonera and Ryan Kaldari from the Wikimedia Mobile team led discussions on the current state of font rendering on mobile apps. They mentioned several cases where CSS creates problems in rendering, for instance text orientation for Japanese may be affected. Currently webfonts have an initial implementation for mobile but may become heavy on the bandwidth.

David Chan walked through the LiteratIM predictive keyboard developed by him for the Android platform. This input method uses a bilingual dictionary for the predictions and provides suggestions for 2 languages based on the word typed in one language. Currently, it supports Welsh-English pairings.

Hari Prasad Nadig walked through of the WikiTrack app, that can be used to track Wikipedia edits and improve interactions between editors across various Indian languages (Kannada, Tamil, Malayalam, Sanskrit). Useful discussion around user experience in multilingual apps as well as technical issues such as font support by native Android or IOS apps were surfaced.

Round up[edit]

Over 60 developers from several organisations and developer community groups participated in November’s Language Summit working together on collaborative projects across teams from Wikimedia Foundation, Red Hat, Fedora, Debian, Google, Mozilla, Microsoft Research, Adobe and Indic language computing groups such as Ankur, FUEL, IndLinux and SMC. Our thanks to Red Hat for generously hosting the Summit at their Pune engineering center. We would also like to thanks shdlr.com for providing us access to host our the session schedule.