Content translation/Product Definition/Dictionaries

Aim: Provide a reliable dictionary back end and api for CX

Pros

 * 1) Widely accepted dictionary protocol. Lot of desktop, webclients. Default dictionary clients in Gnome/KDE/MacOS support this protocol
 * 2) * The protocol is optimised around a human searching for a single dictionary word/phrase, with a response that is parsed by the human's eyeballs. That is not a good match for CX, where we want searching of multiple words and computer parsing of the response. There are both unnecessary features and missing features from a CX point of view.


 * 1) Readily available packaged dictionaries in Debian
 * 2) * But the record format varies enormously -- we will struggle to write "general" code that presents useful data across a number of different dictionaries.


 * 1) Dict servers do fast lookup on available dictionaries and clients does not have any performance overhead - See the Performance testing results
 * 2) Supports the following search strategies
 * 3) exact      Match headwords exactly
 * 4) prefix     Match prefixes
 * 5) nprefix    Match prefixes (skip, count)
 * 6) substring  Match substring occurring anywhere in a headword
 * 7) suffix     Match suffixes
 * 8) re         POSIX 1003.2 (modern) regular expressions
 * 9) regexp     Old (basic) regular expressions
 * 10) soundex    Match using SOUNDEX algorithm
 * 11) lev        Match headwords within Levenshtein distance one
 * 12) word       Match separate words within headwords
 * 13) first      Match the first word within headwords
 * 14) last       Match the last word within headwords
 * 15) * This is arguably a Con: the performance and security implications of a set of rarely-used search strategies all need investigating.

Cons

 * 1) Available dictionaries vary a lot in quality. We might require handpicking dictionaries-
 * 2) To be solved by using alternate dictionary providers depending on availability to language pairs
 * 3) Nodejs client for dictd need to be well written. The existing lient dict.json is not that good
 * 4) https://gerrit.wikimedia.org/r/#/c/134074/ to be improved further, work with author of https://github.com/ptrm/dict.json, make the module a nodejs public module
 * 5) It is an extra burden to deploy and maintain unnecessary TCP services.
 * 6) Since dictd is coming as debian package, deploying using puppet is very easy. The default configuration is enough if the dictd resides in the same server of cxserver
 * 7) If anything goes wrong, some unlucky operations/security people are left trying to understand the little-used RFC 2229 TCP protocol, and an even less-widely used Javascript TCP client, in the middle of the night.
 * 8) Ops does not debug the code or not expected to understand the algorithm or dictd. The worst case situation is dictionary support using dictd is not available for few hours
 * 9) Protocol (written in 1997) is not optimisal for CX
 * 10) * Optimised around low memory usage
 * 11) * Search for a single word/phrase (not ideal for searching for a whole paragraph worth of phrases)
 * 12) * Weak morphological search features (problem if the source language has complex morphology)
 * 13) The lack of competing standards in this area also indicates people found little problem with existing protocol.
 * 14) Other standards are https://en.wikipedia.org/wiki/XDXF https://en.wikipedia.org/wiki/StarDict For Startdict

Performance, availability, load testing
Simulation: 100 concurrent users hitting REST Api(https://gerrit.wikimedia.org/r/#/c/134074/) for 2 mins. Time between requests 2s

$siege -d2 -c100 -t 2m http://localhost:8000/dictionary/pen/en/en The server is now under siege... Lifting the server siege... done.
 * SIEGE 3.0.5
 * Preparing 100 concurrent users for battle.

Transactions:                  11884 hits Availability:                 100.00 % Elapsed time:                 119.50 secs Data transferred:              56.92 MB Response time:                  0.00 secs Transaction rate:              99.45 trans/sec Throughput:                     0.48 MB/sec Concurrency:                    0.09 Successful transactions:       11884 Failed transactions:               0 Longest transaction:            0.06 Shortest transaction:           0.00

Approach 2 - Look up in Json
Convert the dictionary sources to json format (offline) and write a code that does lookup on the json. Provide an http wrapper to allow querying on the json

Pros

 * Immediate:
 * Deployment is simpler (no separate dictd service)
 * Coding is simpler (no need for robust dictd client code)
 * Runtime is simpler (no possibility of dictd protocol/state issues at runtime)
 * Future:
 * Not restricted to dictd dictionaries (can use exports from terminology resources in TBX format etc).
 * Can search every word in a paragraph at once (e.g. to highlight matches in the source text)
 * Can do things the dictd protocol doesn't support well (e.g. better word morphology support in searches)

Cons

 * Immediate:
 * Need to extract the data (but we need the code to do this in any case)
 * Future:
 * Big memory consumption for the in-memory representation of dictionary. The English-English webster dictionary is 39MB uncompressed. Nodejs is never recommended for memory or cpu consuming operations because it is single threaded and can freeze the request for few milliseconds to seconds depending the volume of data we have
 * But if performance becomes a problem we can optimise with any HTTP-based approach (even non-Node if we like).
 * Need to re-implment search strategies. If the lookup is not the native json lookup, we need to add more time to the response
 * But in many cases we will need better search than the dictd files support, especially when the source language is not English: e.g. better word conjugation support.