User:TJones (WMF)/Notes/Language Detection Evaluation/Corpus Info

Language Identification Corpus Information
%lang  %total  lang 77.3%  41.3%  English 5.5%   3.0%  Spanish 2.6%   1.4%  Chinese 2.5%   1.3%  Portuguese 1.3%   0.7%  Arabic 1.3%   0.7%  French 1.2%   0.6%  Tagalog 1.0%   0.6%  German 0.8%   0.4%  Malay 0.6%   0.3%  Russian 0.6%   0.3%  Turkish 0.5%   0.3%  Indonesian 0.5%   0.3%  Persian 0.5%   0.3%  Swahili 0.4%   0.2%  Korean 0.3%   0.1%  Bengali 0.3%   0.1%  Bulgarian 0.3%   0.1%  Hindi 0.3%   0.1%  Italian 0.3%   0.1%  Norwegian 0.1%   0.1%  Croatian 0.1%   0.1%  Dutch 0.1%   0.1%  Estonian 0.1%   0.1%  Finnish 0.1%   0.1%  Greek 0.1%   0.1%  Hmong 0.1%   0.1%  Japanese 0.1%   0.1%  Kannada 0.1%   0.1%  Latin 0.1%   0.1%  Polish 0.1%   0.1%  Serbian 0.1%   0.1%  Somali 0.1%   0.1%  Swedish 0.1%   0.1%  Tamil 0.1%   0.1%  Thai 0.1%   0.1%  Uzbek
 * 1452 zero result queries
 * 775 (53.4%) are tagged as being in some language

Tokens per Query
number of tokens (total) 469    1 tokens 364    2 tokens 213    3 tokens 127    4 tokens 86     5 tokens 58     6 tokens 40     7 tokens 19     8 tokens 23     9 tokens 11     10 tokens 9      11 tokens 5      12 tokens 2      13 tokens 4      14 tokens 3      15 tokens 1      16 tokens 4      17 tokens 1      18 tokens 1      19 tokens 1      21 tokens 1      23 tokens 2      28 tokens 2      30 tokens 2      31 tokens 1      33 tokens 1      34 tokens 1      61 tokens 1      84 tokens number of tokens (lang) 160    1 tokens 152    2 tokens 141    3 tokens 91     4 tokens 63     5 tokens 49     6 tokens 35     7 tokens 18     8 tokens 22     9 tokens 10     10 tokens 9      11 tokens 3      12 tokens 2      13 tokens 4      14 tokens 3      15 tokens 1      16 tokens 3      17 tokens 1      18 tokens 1      21 tokens 1      23 tokens 2      28 tokens 2      30 tokens 1      31 tokens 1      34 tokens 1      61 tokens number of tokens (non-lang) 309    1 tokens 212    2 tokens 72     3 tokens 36     4 tokens 23     5 tokens 9      6 tokens 5      7 tokens 1      8 tokens 1      9 tokens 1      10 tokens 2      12 tokens 1      17 tokens 1      19 tokens 1      31 tokens 1      33 tokens 1      84 tokens