Wikimedia Language engineering/Pune LanguageSummit November 2013/Event Notes
- 1 Open Source Language Summit - November 2013
- 1.1 Day 1
- 1.1.1 Session: Input Methods on VisualEditor (includes jQuery.ime integration)
- 1.1.2 Session: Cross project coverage for basic language support components
- 1.1.3 Session Name: FUEL Sessions
- 1.1.4 Session Name: Keyboard layout Images for documentation of input methods
- 1.1.5 Session Name: Leveraging content translation platforms for Indic languages
- 1.1.6 Session Name: Updating Lohit2 fonts to conform with the new Open Type spec for Indic scripts
- 1.1.7 Session Name: Packaging fonts
- 1.1.8 Session: Identify and document the sources of free licensed bilingual dictionaries
- 1.1.9 Session: Q & A with Behdad Esfahbod: State of the Union: Harfbuzz - Font rendering for Chrome, Android
- 1.2 Day 2
- 1.2.1 Session Name: Indic Font Specification
- 1.2.2 Session Name: Autonym Font
- 1.2.3 Session Name: Content Translation UI prototype testing session
- 1.2.4 Session Name: Onscreen Keyboards
- 1.2.5 Session Name: FUEL Sessions - Demo: Translation Quality Assessment Matrix
- 1.2.6 Session Name: Rendering of fonts on mobile apps
- 1.2.7 Session Name: "Lohit-ising" Open type fonts
- 1.2.8 Session Name: ibus-typing-booster - predictive text typing system
- 1.2.9 Session Name: Fedora SIG - UI Source Message Contextualization
- 1.1 Day 1
Open Source Language Summit - November 2013
- Schedule: http://open-source-language-summit-2013.shdlr.com/grid
- Twitter: Hashtag #languagesummitpune
- IRC: #mediawiki-i18n on FreeNode
Session: Input Methods on VisualEditor (includes jQuery.ime integration)
- David Chan leading; sets off introductions from everyone round-table
- Santhosh introduced jQuery.IME and explained what it is for, why it was built
- David outlined how bug-filing helps - the importance of very specific version numbers, exact keystrokes to fire the IME, and expected and observed behaviours, and the problems facing comprehensive IME support
- David demonstrated the EventLogger system capturing IME input event streams, giving detailed run through of several IMEs and the events that they can create
- David showed the draft automated IME testing framework he has built for VisualEditor and explained his intention to build a library of as many languages, IMEs and OSes as possible to test them.
- Santhosh discussed how jQuery.IME can help simplify the needs in VisualEditor because it doesn't operate in a different way in each script/browser/OS
- Santhosh demonstrated problems like multiple different conflicting numbers (e.g. cursor positions vs. key strokes vs. Unicode code points vs. backspace positions)
- Santhosh returned to the reasons why IME difficulties are an issue for VisualEditor, due to the need to do non-native programmatic management of the contentEditable surface to support generated content blocks like images or templates
- Pau asked about the relative value of on-screen keyboards, predictive type, spell-checking, hand-writing recognition etc.
General discussion about possibilities and requirements from Indic scripts
- Particular requests for VisualEditor
- Support for native IMEs – especially for users with Windows as their OS
- In-built IME in VE (e.g. expectations of auto-convert on space/save)
- Auto-completion based on dictionaries
Volunteer language experts for Indic languages
- Samyak Bhuta, for Gujarati, samyak.bhuta @ gmail dot com
- Vijay Languages Marath,Hindi,Sanskrit,Nepali,Ahiranii mahitgar at yahoo dot co dot in
- Good mix of participants (technical and non-technical Wikipedians, OSS contributors)
- Brainstormed about handling complexities of input tools for Indic languages, trapping keystrokes, event ordering, DOM model, event logger tool
- Log submission now available, please contribute! URL: http://tinyurl.com/imelogform
- URL: https://bit.ly/ve-eventlogger, https://bit.ly/ve-imefeedback
- Submissions for Indic language IMEs are especially welcome
- OSKs vs Latin keyboards advantages/disadvantages
- Learnt a lot of Indic languages - bilingual usage, code switching, switching across languages, Issues around ime usage
- Santhosh - identifying the problem definitions, patches in progress
- Abhijit - highlighted cross-browser, cross-platform differences; working with original core developers,
- David - in developing for ibus - why are event sequences diferent? may not be possible for languages (HPN)
- OSK - T9 input optimized for mobile usage - standardized (Hari)
Session: Cross project coverage for basic language support components
- Showcasing the Language Coverage Dashboard
- Desktop Support Requirements: (Pravin Satpute talking about the Fedora world)
- Character Encoding
- Shaping Engines
- Input Methods
- OS Level Support
- Locale Definition (CLDR)
- Minimum Criteria for Language Support
- If an ISO code does not exist, the language cannot be used on the desktop
- Desktop Enhancements
- Plan to check the language coverage in WMF projects for standardized ISO recognised language and assess coverage for Desktop language support]
- Overview of what the GSoc team developed, features developed and demo'ed, plans for future visualizations and features
- Fedora desktop support features as use case for LCMD (ISO-less languages are not handled for desktop)
- Extending for Fedora desktop
- Suggestion from Hari Nadig - data from LCMD can be used through a Mediawiki extension for Indic language Wiki projects to show some stats for Indic language projects (which is being developed
- Will evaluate for Fedora desktop and implement (next step)
- Is there an option to contribute instead of forking CLDR
- CLDR - contributing to it instead of forking - experts should review
Session Name: FUEL Sessions
FUEL color module:
FUEL date and time module:
FUEL number module:
During the today's Language Summit (18th November, 2013), we discussed about the existing FUEL-colors module. It was observed that the current one is not so definitive and came up with following points:
- we will follow the list of colors given in http://www.w3.org/TR/css3-color/
- we will be creating two modules, fuel-colors-basic, fuel-colors-extended.
- fuel-colors-basic: http://www.w3.org/TR/css3-color/#html4
- fuel-colors-extended: http://www.w3.org/TR/css3-color/#svg-color
- This is just a proposal. If you have any issues or suggestion let us discuss here.
- We will be closing this discussion probably by 30th of November and of course we can extend this date, if the discussion is prolonged.
FUEL - Translation Quality Assessment Matrix:
- Translation Quality Assessment Matrix (TQAM) is a first matrix to assess translation quality under open license.
- The participant accepted that broadly it is helpful for a translation team, community, translator or editor.
- Got some suggestions related to UI of the TQAM
Retrospective: (Siebrand) (on all 3 sessions on FUEL through the day)
- What is FUEL - Siebrand provided a short blurb on what FUEL is
- Objective to make localizations more consistent
- There 3 collections currently and 3 in progress (color, number, date/time)
- Colors discussion:
- 250 colors taken from Wikipedia categories were reviewed, (xkcd ref: 15 instead of 14 standards - yet another standard?)
- Instead of trying to create a new collection, reusing is better - looked at W3C CSS standard - 131 colors
- Cultural bias in defining collection colors? Do we need to remove this cultural bias?
- Or have the standard changed?
- Name of the color should be localized not re-invented
- DTTM discussion:
- Interesting discussion, CLDR has a few flaws - have to pay money to vote on what goes into the collection
- Paid members choose from contributions
- FUEL strategy is to create a new standard; inconclusive discussions
- Few options on the table:
- Fork CLDR
- Work with CLDR and find ways to collaborate
- Create a competing standard
- Will be discussed on mailing list - progress - as to what to do next and how; Siebrand will send this email
- Number discussion:
- List was created with 1-100, ordinals
- Part of this collection was out of scope for FUEL
- Translators localizing numbers may not be useful
- Ordinals could be used as adjectives so is not as easy as it looks
- Rajesh will send an email to the FUEL mailing list and then decide how to fulfill those functional requirements
Session Name: Keyboard layout Images for documentation of input methods
- Latest inscript2 keymap images are captured and saved at 
- Languages that Need Help Documents: 
- Image Generation script: Python script that takes keymap filename input and shows mappings in UI.
- This works only for 1:1 mapping keymaps
Retrospective notes: (Parag)
- Documentation done but to be uploaded
- TWN - feedback can be provided through Ask a Question?
- WMF - wikis - there is a huge problem directing these user questions, no centralized system to process comments from users (Siebrand)
- We do not have a specific feedback method at the moment other than the talk page. (Pau)
Session Name: Leveraging content translation platforms for Indic languages
- Microsoft Research
- Translation platform demo
- Discussion on various content translation components in MSFT and Google
- Web based data is key to training MT engines
Session Name: Updating Lohit2 fonts to conform with the new Open Type spec for Indic scripts
- Presentation on Idea behind lohit2 (http://pravin-s.blogspot.in/2013/08/project-creating-standard-and-reusable.html)
- Depth discussion on Adobe Glyph Nameing guidelines and problems
- Demonstration on Kannada Work done by Aravinda (https://aravindavk.in/blog/improving-kannada-fonts/)
- Sneha presented on Process followed for Lohit2 Devanagari, Gujarati
- Santhosh presented on GSoc automated testing project.
- Session on Lohit 2 improvements
- Adobe glyph list - clarifying doubts
- Aravinda - talked about Kannada block - script specializations
- Sneha - Walkthrough of development process for Lohit
- Santhosh - walked through automated testing process
Session Name: Packaging fonts
- Fonts available in Debian
- Fonts available in Fedora
- Packaging as much as fonts in Debian, Fedora and other distribution so that it won't load as 'webfonts' (61 fonts in repository) when use is accessing Wikipedia pages.
- Compare fonts in ULS, Debian and Fedora (see links above).
- Package missing fonts for Debian/Fedora.
- Write automated 'New upstream' check for ULS.
- Update to new upstreams: https://gerrit.wikimedia.org/r/#/c/96008/
- Fedora bugs filed:
- https://bugzilla.redhat.com/show_bug.cgi?id=1031587 (tharlon-fonts)
- https://bugzilla.redhat.com/show_bug.cgi?id=1031588 (phetsarath-fonts)
- https://bugzilla.redhat.com/show_bug.cgi?id=1031603 (tuladha-jejeg-fonts)
- https://bugzilla.redhat.com/show_bug.cgi?id=1031569 (cdac-sakal-marathi-fonts)
- Debian bugs filed:
- Packaging consistencies across Fedora, Debian, Wikimedia
- 61 fonts in ULS repo - checked in Fedora or Debian - if missing adding these fonts to Fedora and Debian - Vasudev
- Aksharyogini and Sakal Marathi and Meera Tamil are getting added
- Defining mechanisms to maintain fonts so how can we automate process (Kartik will work on this)
- Fedora and Debian have mechanisms to automately check
Session: Identify and document the sources of free licensed bilingual dictionaries
- Mediawiki Page
- Is there any free licensed licensed bilingual dictionaries?
- freedict: is client/server model 'dictd' protocol.
- freedict only available for Hindi (from Indic languages).
- Artha: http://artha.sourceforge.net/
- No 'well defined' Wiktionary API: will take many months to have it with wikidata.
- Write or use API where it can be available.
- GujaratiLexicon.com API: Kartik/Samyak to work.
- Created an useful document on mediawiki.org
Session: Q & A with Behdad Esfahbod: State of the Union: Harfbuzz - Font rendering for Chrome, Android
- Santhosh: How to do testing better for Indic scripts
- Windows - testing is key - Behdad tests all bug fixes on Windows
- Open Type support on IOS
- Apple has full open type support now (google is more competitive w msft than appl is since it wants feature compatibility)
- Firefox has been shipping with Harfbuzz on every platform
- Jonathan Kew is working on testing infrastr for Windows
- Webfonts - my vision for the web - a font file should run on every web browser
- Mobile web fonts - noto was designed to support this use case
- Open Type Spec for Google Noto fonts
Session Name: Indic Font Specification
Notes Github repo
- Slow progress so far due to being an open call for participation
- Script ownership needed
- Add documentation to github that you don't need to know latex to contribute, doc formats work, wikis don't work for illustrations
- Regular process sync-up needed
- A new mailing list will help
- Aravinda VK
- Hari Nadig
- Ravi Pande
- In the 'General Section' add:
- Jargon - Conjunct formation rules
- Controversies on ligature usage in each language
- Sans vs sans-serif (equal stem width vs sans-serif)
- Alternate styles are being examined
- Italics are alien to Indic scripts
- Other sections needed:
- Glossaries of terms from each language (e.g. virama, halant, pulli...)
- Next steps:
- Collaborative efforts with IITB, NID etc
- Devanagari - TDIL recommendations 
- Reference docs:
- Can all who contributed/prospective contributors be invited to join the github repo?
Session Name: Autonym Font
- Github: https://github.com/santhoshtr/AutonymFont
- Santhosh walks through autonym font using for language names
- 460 characters currently
- Andrew Cunningham - contributed patches
- List of languages is from CLDR
- Wikipedia supports 287 languages; 300+ for other wiki projects,
- Use case does not need punctuation (fallback to system font)
- Scaled for consistent height and width
- Tests need to be completed for autonym font
- Legacy systems cannot handle hinting (e.g. Windows XP)
- Full parentheses - different code point
- Source code available on github
- Issues open on autonym font
- Cross-browser, cross-platform testing
- Maintaining size optimizations by not defining rendering rules
- Varying stem width - serif
- Uniform stem width - sans-serif
- Monospace - doesnt exist in Indic languages
Session Name: Content Translation UI prototype testing session
- Notetaker: Jared Zimmerman
- Its a tedious process to translate articles from English to Gujarati, I use google translate as a means of getting the gist but have to manually translate
- I open two windows and manually translate between the two
- I use google translator toolkit, they have a link to pull in the wikipedia article automatically, with a split screen interface, with translation suggestions, its better than nothing, but could be better. One thing that is nice about it is that you can collaborate with other users at the same time. (Pau) Do you use that?(/Pau) Yes, I have.
- "For new (non-english speakers) users to wikipedia translating should be one of the easiest tasks you can do
- (?) if linked article does not exist, would it be better to red link vs linking to original language article or wikidata item(?)
- P1 - Entry-point discover
- translation entry-point isn't obvious. (I (P1, Nayan) was searching at top right corner for "translate" button/link )
- would copy the name of the article to google the title to find a version in his language, and if it didn't exsist, start translating it
- (?) more obvious translation entry-point
- "honestly I've never noticed the language list"
- Amir: It's anecdotal, but lots of people say this
- If I intend to translate it into my language I'll first go search for that version in my language before trying to create it
- The call to action was clear, users inclination was to translate from scratch rather than start with english as a starting point "It easier to translate from scratch, since the language is so different, it reads so differently in Kanada"
- Amir: The prototype is build only for Dutch (Nederlands), so it's less discoverable for people who don't speak Dutch.
- Amir: The box that opens when the red interlanguage link is clicked mixes and English message with the Dutch autonym ("This page is not translated to Nederlands"). We may consider doing this whole box in the target language. Jared: or both languages?
P2 - Translation Dashboard
- understands the buckets of in progress, completed, etc
- (?) why show the same language as both a origin and target language
- unsure why he would change the title of the article on the creation screen
- Hard for user to understand exactly what's going on with translation variations (because prototype is in Dutch)
- user understands general principle that there are options that the system is showing him
- Why do some words have translation and some have "word information" I want both types of information for all the things that I would click on
- when interacting with interwiki links, unclear what "Paste source link" is
- (?) perhaps expand interwiki link action rather than hiding them in a dropdown
P1: When you select words on the translated text it doesn't highlight the text in the original text?
P3 - Translation workflow
- (?) "Add translation" should be in destination language? or both
- Amir, technical comment: The prototype is probably made for Chrome. The "Add translation" button is displayed incorrectly in Firefox. However, Chrome has issues with rendering Bengali correctly.
- is this an automatic translation? where is it from
- What is the source of the automatic translation
- Pau enables input methods
- user adjusts auto-translated text
- Pau: Can you tell what part you've manually translated?
- Once user has scrolled down it wasn't immediately obvious what percentage of the article that she'd translated/manually translated
- Subtle progress bar was not noticed, and the color difference between auto and manual was not noticed
- (?) showing the yellow warning box might be a little annoying for users who want to start with all auto-translated then rewite from that point
- (?) maybe show little flyouts when the translation percentage changes with the number
- P1 : I'll probably only translate articles that I have some familiarity with
- P1 : I saw the progress bar but its not that noticeable, perhaps have it be the full screen width
- Groups Brainstorm
- Would want to see side by side translation from different services
- show all available services in right sidebar, including the one that is already displayed as the proposed translation (with the default highlighted)
- replace digits (numbers)
- seems like this might be an issue with the translation provider?
- user will likely have to translate these manually for now (since they aren't automatically translated by service)
- Amir: the left and right column don't synchronize properly(?)
- for interwiki links in target language (redlink/remove/source lang/wikidata) consensus seems to be between remove or red
- special syntax for redlinks to wikidata (greylinks?)
- Use untranslated links from articles that I've translated as suggested articles that I can translates since they are already likely to be in my interest area.
- Collaboration : we created "translation drive" with the articles that we were currently translating, with google translation toolkit, people would claim individual articles to contribute to.
- there were few (no) instances of people actually collaborating on a single article together.
- "I don't know if my translation is good enough to be published"
- Will this interface be used for translatewiki.net or vice versa? (No, not right now this is optimized for long form content with links, not short interface strings like translatewiki.net)
- Provide corrected information back to translation services as a means of convincing them to provide translations to us
Session Name: Onscreen Keyboards
- Pau and Praveen are having a 6 8 5 session, sketching ideas for keyboards, auto-completion, spelling etc.
- Nayan showed a sketch for a transliteration typing tool. Hary Prasad Nadig showed an existing implementation of a similar idea in Mac.
- Amir: a context menu to share additions to the spelling checker to a network dictionary (Wiktionary, Wikidata, OmegaWiki, something else, whatever). Crowdsource dictionary building. Comments:
- How to know which language? (A: by the lang attribute)
- How to check if it's correct? (A: A maintainer is needed.)
- Suggestion: Automatically add to the local dictionary.
Session Name: FUEL Sessions - Demo: Translation Quality Assessment Matrix
Session Name: Rendering of fonts on mobile apps
- There are many more mobile browsers than desktop browsers.
- CSS creates problems.
- Japanese may become vertical without a reason when it's supposed to be horizontal.
- Works fairly well on desktop.
- Has an initial implementation for mobile.
- Uses the HTML lang attribute.
- Webfonts may be very heavy on the bandwidth.
- Identifying fonts on the client:
- Render a name and measure the result size: if it's tofu, it will have the same expected size. If it's different, then it works.
Session Name: "Lohit-ising" Open type fonts
- Standardise the Indic fonts according to Lohit2.
- Example fonts: Samyak, Sakal Marathi.
- Follow AGL, Unicode specification.
- Standardise the glyphs names. Discussions: Ravi, Pravin, Aravinda.
- Query to lohit-dev mailing list where glyph names differs.
- Pravin/Sneha: No awareness around AGL.
- Kartik: Kalapi has unused/extra glyphs which can be standardise according to Lohit2 Gujarati glyphs.
- Standards v/s Typography discussion.
- TDIL Standard Devanagari Script Behaviour document can be used as reference.
- Ravi: AGL is specification, it is not standard yet.
- Recap of steps by Pravin.
- Testing is ongoing for Beta Lohit fonts.
- Gujarati Lohit 2 is in Alpha stage, but can be used as an example.
- Pravin to blog about 'steps' for Lohit-ising the OT fonts in details, although Sneha/Pravin's blog post contains background and needed information.
Session Name: ibus-typing-booster - predictive text typing system
Notetaker: Pravin Satpute
- Presented by Anish and Pravin on idea behind ibus-typing-booster and what are the features and how it will be helpful over the time.
- During testing we got around 4 bugs from audience.
Session Name: Fedora SIG - UI Source Message Contextualization
- The session was about the need of contextualization in source strings of applications to ensure correct translation and correct convey of message.
- It was presented my Shankar Prasad where he showed the situations where context is genuinely needed in the source strings and also how context is
- added to the source code.
- Siebrand gave some useful advice on how to write context.
- The group formed for source string contextualization is named as SSCG (Source String Contextualizing Group)
- Fedora SIG Page