User talk:Amgine/Dump processing/test xml.php

Todos:

After if( !preg_match add an else which adds the word to a $blacklist, print the array separately in a single txt file.
Check the word is single script? (Actually, each dictionary as a whole should be single script.) [1]
Check the string is not confusable.[2]

--Nemo 12:39, 30 March 2014 (UTC)Reply

Moved your spoofchecker::isSuspicious up to the ns=0 check so it covers both add2Dictionary calls.
Does the spoofchecker cover single script, confusable? The problem is that many terms are normalized in other languages, and some languages have multiple writing systems, e.g. Japanese has 4 including Romaji. - Amgine (talk) 00:47, 2 April 2014 (UTC)Reply

I'm not sure about the spoofchecker; if the docs are correct, no it doesn't, but that and is a bit confusing.[3] Currently I'm not even sure the setchecks call is working, I'll need to check what the actual effects are. We may add WHOLE_SCRIPT_CONFUSABLE if it doesn't remove too much stuff. --Nemo 07:20, 2 April 2014 (UTC)Reply

By quickly glancing at the results (new by me vs. old by Amgine), it seems it removed almost all non-latin characters, which is good for Vietnamese (per Mxn) and ok for Serbo-Croatian (consistency makes at least one part happy) but nonsense for Russian. Will need to play a bit more with the options. --Nemo 09:47, 3 April 2014 (UTC)Reply

Actually, no. Serbo-Croatian on en.WT includes Bosnian, Serbian, and Croatian, all of which are written in w:Gaj's Latin alphabet amongs other writing systems. - Amgine (talk) 15:01, 3 April 2014 (UTC)Reply

Based on what I can find, to reduce possible complaints here are some rules we should create per-language:

bs - latinica script (Gaj's), optionally include w:Serbian Cyrillic alphabet
hr - latinica only (Gaj's)
sh - cyrillic script (Serbian) & latinica (Gaj's)
sr - cyrillic script (Serbian), optionally include latinica (Gaj's)

- Amgine (talk) 15:38, 3 April 2014 (UTC)Reply