User talk:Amgine/Dump processing/test xml.php

Todos: --Nemo 12:39, 30 March 2014 (UTC)
 * After  add an else which adds the word to a $blacklist, print the array separately in a single txt file.
 * Check the word is single script? (Actually, each dictionary as a whole should be single script.)
 * Check the string is not confusable.
 * Moved your spoofchecker::isSuspicious up to the ns=0 check so it covers both add2Dictionary calls.
 * Does the spoofchecker cover single script, confusable? The problem is that many terms are normalized in other languages, and some languages have multiple writing systems, e.g. Japanese has 4 including Romaji. - Amgine (talk) 00:47, 2 April 2014 (UTC)
 * I'm not sure about the spoofchecker; if the docs are correct, no it doesn't, but that and is a bit confusing. Currently I'm not even sure the setchecks call is working, I'll need to check what the actual effects are. We may add WHOLE_SCRIPT_CONFUSABLE if it doesn't remove too much stuff. --Nemo 07:20, 2 April 2014 (UTC)
 * By quickly glancing at the results (new by me vs. old by Amgine), it seems it removed almost all non-latin characters, which is good for Vietnamese (per Mxn) and ok for Serbo-Croatian (consistency makes at least one part happy) but nonsense for Russian. Will need to play a bit more with the options. --Nemo 09:47, 3 April 2014 (UTC)
 * Actually, no. Serbo-Croatian on en.WT includes Bosnian, Serbian, and Croatian, all of which are written in Gaj's Latin alphabet amongs other writing systems. - Amgine (talk) 15:01, 3 April 2014 (UTC)
 * Based on what I can find, to reduce possible complaints here are some rules we should create per-language:
 * bs - latinica script (Gaj's), optionally include Serbian Cyrillic alphabet
 * hr - latinica only (Gaj's)
 * sh - cyrillic script (Serbian) & latinica (Gaj's)
 * sr - cyrillic script (Serbian), optionally include latinica (Gaj's)
 * - Amgine (talk) 15:38, 3 April 2014 (UTC)