Content translation/Product Definition/Dictionaries

Aim: Provide a reliable dictionary back end and api for CX

Pros

 * 1) Widely accepted dictionary protocol. Lot of desktop, webclients. Default dictionary clients in Gnome/KDE/MacOS support this protocol
 * 2) Readily available packaged dictionaries in Debian
 * 3) Dict servers do fast lookup on available dictionaries and clients does not have any performance overhead - See the Performance testing results
 * 4) Supports the following search strategies
 * 5) exact      Match headwords exactly
 * 6) prefix     Match prefixes
 * 7) nprefix    Match prefixes (skip, count)
 * 8) substring  Match substring occurring anywhere in a headword
 * 9) suffix     Match suffixes
 * 10) re         POSIX 1003.2 (modern) regular expressions
 * 11) regexp     Old (basic) regular expressions
 * 12) soundex    Match using SOUNDEX algorithm
 * 13) lev        Match headwords within Levenshtein distance one
 * 14) word       Match separate words within headwords
 * 15) first      Match the first word within headwords
 * 16) last       Match the last word within headwords

Cons

 * 1) Available dictionaries vary a lot in quality. We might require handpicking dictionaries-
 * 2) To be solved by using alternate dictionary providers depending on availability to language pairs
 * 3) Nodejs client for dictd need to be well written. The existing lient dict.json is not that good
 * 4) https://gerrit.wikimedia.org/r/#/c/134074/ to be improved further, work with author of https://github.com/ptrm/dict.json, make the module a nodejs public module
 * 5) It is an extra burden to deploy and maintain unnecessary TCP services.
 * 6) Since dictd is coming as debian package, deploying using puppet is very easy. The default configuration is enough if the dictd resides in the same server of cxserver
 * 7) If anything goes wrong, some unlucky operations/security people are left trying to understand the little-used RFC 2229 TCP protocol, and an even less-widely used Javascript TCP client, in the middle of the night.
 * 8) Ops does not debug the code or not expected to understand the algorithm or dictd. The worst case situation is dictionary support using dictd is not available for few hours
 * 9) Protocol written in 1997
 * 10) The lack of competing standards in this area also indicates people found little problem with existing protocol.
 * 11) Other standards are https://en.wikipedia.org/wiki/XDXF https://en.wikipedia.org/wiki/StarDict For Startdict

Performance, availability, load testing
Simulation: 100 concurrent users hitting REST Api(https://gerrit.wikimedia.org/r/#/c/134074/) for 2 mins. Time between requests 2s

$siege -d2 -c100 -t 2m http://localhost:8000/dictionary/pen/en/en The server is now under siege... Lifting the server siege... done.
 * SIEGE 3.0.5
 * Preparing 100 concurrent users for battle.

Transactions:                  11884 hits Availability:                 100.00 % Elapsed time:                 119.50 secs Data transferred:              56.92 MB Response time:                  0.00 secs Transaction rate:              99.45 trans/sec Throughput:                     0.48 MB/sec Concurrency:                    0.09 Successful transactions:       11884 Failed transactions:               0 Longest transaction:            0.06 Shortest transaction:           0.00

Approach 2 - Look up in Json
Convert the dictionary sources to json format (offline) and write a code that does lookup on the json. Provide an http wrapper to allow querying on the json

Pros

 * 1) No dependency on external server. It can be a self contained module

Cons

 * 1) Big memory consumption for the in-memory representation of dictionary. The English-English webster dictionary is 39MB uncompressed. Nodejs is never recommended for memory or cpu consuming operations because it is single threaded and can freeze the request for few milliseconds to seconds depending the volume of data we have
 * 2) Need to re-implment search strategies. If the lookup is not the native json lookup, we need to add more time to the response