User:TJones (WMF)/Notes/Language Detection Evaluation/Corpus Info

From mediawiki.org

Language Identification Corpus Information[edit]

  • 1452 zero result queries
  • 775 (53.4%) are tagged as being in some language
%lang   %total  lang
 77.3%   41.3%  English
  5.5%    3.0%  Spanish
  2.6%    1.4%  Chinese
  2.5%    1.3%  Portuguese
  1.3%    0.7%  Arabic
  1.3%    0.7%  French
  1.2%    0.6%  Tagalog
  1.0%    0.6%  German
  0.8%    0.4%  Malay
  0.6%    0.3%  Russian
  0.6%    0.3%  Turkish
  0.5%    0.3%  Indonesian
  0.5%    0.3%  Persian
  0.5%    0.3%  Swahili
  0.4%    0.2%  Korean
  0.3%    0.1%  Bengali
  0.3%    0.1%  Bulgarian
  0.3%    0.1%  Hindi
  0.3%    0.1%  Italian
  0.3%    0.1%  Norwegian
  0.1%    0.1%  Croatian
  0.1%    0.1%  Dutch
  0.1%    0.1%  Estonian
  0.1%    0.1%  Finnish
  0.1%    0.1%  Greek
  0.1%    0.1%  Hmong
  0.1%    0.1%  Japanese
  0.1%    0.1%  Kannada
  0.1%    0.1%  Latin
  0.1%    0.1%  Polish
  0.1%    0.1%  Serbian
  0.1%    0.1%  Somali
  0.1%    0.1%  Swedish
  0.1%    0.1%  Tamil
  0.1%    0.1%  Thai
  0.1%    0.1%  Uzbek

Tokens per Query[edit]

number of tokens (total)
469     1 tokens
364     2 tokens
213     3 tokens
127     4 tokens
86      5 tokens
58      6 tokens
40      7 tokens
19      8 tokens
23      9 tokens
11      10 tokens
9       11 tokens
5       12 tokens
2       13 tokens
4       14 tokens
3       15 tokens
1       16 tokens
4       17 tokens
1       18 tokens
1       19 tokens
1       21 tokens
1       23 tokens
2       28 tokens
2       30 tokens
2       31 tokens
1       33 tokens
1       34 tokens
1       61 tokens
1       84 tokens

number of tokens (lang)
160     1 tokens
152     2 tokens
141     3 tokens
91      4 tokens
63      5 tokens
49      6 tokens
35      7 tokens
18      8 tokens
22      9 tokens
10      10 tokens
9       11 tokens
3       12 tokens
2       13 tokens
4       14 tokens
3       15 tokens
1       16 tokens
3       17 tokens
1       18 tokens
1       21 tokens
1       23 tokens
2       28 tokens
2       30 tokens
1       31 tokens
1       34 tokens
1       61 tokens

number of tokens (non-lang)
309     1 tokens
212     2 tokens
72      3 tokens
36      4 tokens
23      5 tokens
9       6 tokens
5       7 tokens
1       8 tokens
1       9 tokens
1       10 tokens
2       12 tokens
1       17 tokens
1       19 tokens
1       31 tokens
1       33 tokens
1       84 tokens