User talk:Amgine/Dump processing/test xml.php

From mediawiki.org
Latest comment: 10 years ago by Amgine

Todos:

  • After if( !preg_match add an else which adds the word to a $blacklist, print the array separately in a single txt file.
  • Check the word is single script? (Actually, each dictionary as a whole should be single script.) [1]
  • Check the string is not confusable.[2]

--Nemo 12:39, 30 March 2014 (UTC)Reply

  • Moved your spoofchecker::isSuspicious up to the ns=0 check so it covers both add2Dictionary calls.
  • Does the spoofchecker cover single script, confusable? The problem is that many terms are normalized in other languages, and some languages have multiple writing systems, e.g. Japanese has 4 including Romaji. - Amgine (talk) 00:47, 2 April 2014 (UTC)Reply
I'm not sure about the spoofchecker; if the docs are correct, no it doesn't, but that and is a bit confusing.[3] Currently I'm not even sure the setchecks call is working, I'll need to check what the actual effects are. We may add WHOLE_SCRIPT_CONFUSABLE if it doesn't remove too much stuff. --Nemo 07:20, 2 April 2014 (UTC)Reply
By quickly glancing at the results (new by me vs. old by Amgine), it seems it removed almost all non-latin characters, which is good for Vietnamese (per Mxn) and ok for Serbo-Croatian (consistency makes at least one part happy) but nonsense for Russian. Will need to play a bit more with the options. --Nemo 09:47, 3 April 2014 (UTC)Reply
Actually, no. Serbo-Croatian on en.WT includes Bosnian, Serbian, and Croatian, all of which are written in w:Gaj's Latin alphabet amongs other writing systems. - Amgine (talk) 15:01, 3 April 2014 (UTC)Reply
Based on what I can find, to reduce possible complaints here are some rules we should create per-language:
  • bs - latinica script (Gaj's), optionally include w:Serbian Cyrillic alphabet
  • hr - latinica only (Gaj's)
  • sh - cyrillic script (Serbian) & latinica (Gaj's)
  • sr - cyrillic script (Serbian), optionally include latinica (Gaj's)
- Amgine (talk) 15:38, 3 April 2014 (UTC)Reply