Content translation/Product Definition/Dictionaries

Aim: Provide a reliable dictionary back end and api for CX

Free licensed multi lingual dictionaries are not widely available. The DICT protocol based dictionaries are packaged in GNU/Linux distros. The quality of these dictionaries vary a lot. There are many websites allowing users to lookup meaning for words. But they are designed for users in mind and provide a software consumable data(structured data). Wiktionary has lot of dictionary but its data is not well structured. Considering all these real world problems, the dictionary backend of CX is flexible to support multiple backend providers whilte it try to expose a general dictionary lookup REST api.

Pros

 * 1) Widely accepted dictionary protocol. Lot of desktop, webclients. Default dictionary clients in Gnome/KDE/MacOS support this protocol


 * 1) Readily available packaged dictionaries in Debian


 * 1) Dict servers do fast lookup on available dictionaries and clients does not have any performance overhead - See the Performance testing results
 * 2) Supports the following search strategies
 * 3) exact      Match headwords exactly
 * 4) prefix     Match prefixes
 * 5) nprefix    Match prefixes (skip, count)
 * 6) substring  Match substring occurring anywhere in a headword
 * 7) suffix     Match suffixes
 * 8) re         POSIX 1003.2 (modern) regular expressions
 * 9) regexp     Old (basic) regular expressions
 * 10) soundex    Match using SOUNDEX algorithm
 * 11) lev        Match headwords within Levenshtein distance one
 * 12) word       Match separate words within headwords
 * 13) first      Match the first word within headwords
 * 14) last       Match the last word within headwords
 * 15) * This is arguably a Con: the performance and security implications of a set of rarely-used search strategies all need investigating.

Cons

 * 1) Available dictionaries vary a lot in quality. We might require handpicking dictionaries. To be solved by using alternate dictionary providers depending on availability to language pairs
 * 2) The protocol is optimised around a human searching for a single dictionary word/phrase, with a response that is parsed by the human's eyeballs

Performance, availability, load testing
Simulation: 100 concurrent users hitting REST Api (https://gerrit.wikimedia.org/r/#/c/134074/) for 2 mins. Time between requests 2s

$siege -d2 -c100 -t 2m http://localhost:8000/dictionary/pen/en/en The server is now under siege... Lifting the server siege... done.
 * SIEGE 3.0.5
 * Preparing 100 concurrent users for battle.

Transactions:                  11884 hits Availability:                 100.00 % Elapsed time:                 119.50 secs Data transferred:              56.92 MB Response time:                  0.00 secs Transaction rate:              99.45 trans/sec Throughput:                     0.48 MB/sec Concurrency:                    0.09 Successful transactions:       11884 Failed transactions:               0 Longest transaction:            0.06 Shortest transaction:           0.00

JSON Dictionary file
Convert the dictionary sources to json format (offline) and write a code that does lookup on the json.

Pros

 * Immediate:
 * Deployment is simpler (no separate dictd service)
 * Coding is simpler (no need for robust dictd client code)
 * Runtime is simpler (no possibility of dictd protocol/state issues at runtime)
 * Future:
 * Not restricted to dictd dictionaries (can use exports from terminology resources in TBX format etc).
 * Can search every word in a paragraph at once (e.g. to highlight matches in the source text)
 * Can do things the dictd protocol doesn't support well (e.g. better word morphology support in searches)

Cons

 * Immediate:
 * Need to extract the data (but we need the code to do this in any case).
 * Each dictionary needs data mining separately to get good word correspondences
 * The extraction process is quite simple (a few lines of code)
 * The size is quite small (~240K of uncompressed json for 6000 word pairs)
 * The data quality will vary in many ways (number of headwords, subject, richness of information etc)
 * Caution is required before assuming information will be useful to the user (does a translator really need to know what simple words like "you" mean?)
 * Future:
 * Big memory consumption for the in-memory representation of dictionary. The English-English webster dictionary is 39MB uncompressed. Nodejs is never recommended for memory or cpu consuming operations because it is single threaded and can freeze the request for few milliseconds to seconds depending the volume of data we have. If performance becomes a problem we can optimise with any HTTP-based approach (even non-Node if we like).
 * Need to re-implment search strategies. If the lookup is not the native json lookup, we need to add more time to the response. But in many cases we will need better search than the dictd files support, especially when the source language is not English: e.g. better word conjugation support.

API
API URL: dictionary/word/sourceLanguage/targetLanguage

Example http://cxserver.wmflabs.org/dictionary/egg/en/es

If the backend cannot provide structured information like this, output will be

Sister projects
Look up a word at wiktionary or wikipedia interwiki. For instance, names of people and places are not always present in a classic dictionary. Should content translation be able to query API of interwiki of these two projects (or wikidata sisterlinks thingy) to assist with transliteration routine?