User:TJones (WMF)/Notes/Spaceless Writing Systems and Wiki-Projects

November 2016 — See TJones_(WMF)/Notes for other projects. (T149717)

Background
Because of concerns related to changes in queries associated with the switch from tf-idf to BM25 in Elasticsearch, we need to identify languages that don't use spaces between words in their writing systems.

I used the table of Wikimedia projects as the place to start my analysis. Since every language has a Wikipedia, I reviewed primarily Wikipedia projects. When the Wikipedia project revealed something interesting, I occasionally looked at other projects (e.g., Javanese Wiktionary). For the most part, I did not investigate languages with no active projects (e.g., there was a Wikipedia, but it has since been closed).

Spaceless Writing Systems
These languages/projects primarily use writing systems that do not use spaces between words. Spaces may be used between phrases, sentences, or not at all. The Nuosu/Yi (ii) Wikipedia only has only a handful of articles and is crossed out in the table, but it doesn't have the official "This wiki has been closed" banner that others do. I believe it has been closed, but I'm not sure. Nuosu/Yi also doesn't use spaces.
 * Languages: Tibetan, Dzongkha, Gan, Japanese, Khmer, Lao, Burmese, Thai, Wu, Chinese, Classical Chinese, Cantonese
 * Codes: bo, dz, gan, ja, km, lo, my, th, wuu, zh, zh-classical, zh-yue

Mixed Space/Spaceless Projects
These languages/projects use some writing with spaces and some without spaces in various projects. Explanatory notes are below. The Buginese script (also called the Lontara alphabet) doesn't have spaces. Both Lontara and Latin are used on the Buginese Wikipedia, though Latin seems to predominate.
 * Languages: Buginese, Min Dong, Cree, Hakka, Javanese, Min Nan
 * Codes: bug, cdo, cr, hak, jv, zh-min-nan

The Cree Wikipedia uses a mix of Latin script and Cree syllabics for Cree, with at least some entries in English.

The Javanese script doesn't use spaces, but the Javanese Wikipedia uses the Javanese Latin script. The Javanese Wiktionary, however, has entries in both Latin and Javanese scripts.

The English Wikipedia page on the Chinese Wikipedia notes that three of the Chinese languages use both Chinese and Romanized writing systems. From my own investigation:
 * The Min Dong Wikipedia is written largely with Foochow Romanized (a Latin script), but a number of pages are in Chinese.
 * The Hakka Wikipedia is written largely with Pha̍k-fa-sṳ (a Latin script), but a number of pages are in Chinese.
 * The Min Nan Wikipedia is written largely with Pe̍h-ōe-jī (a Latin script), but with a small number of pages in Chinese.

Other Languages With Long, Highly-Inflected Words
This list above doesn't address languages that are polysynthetic or agglutinative, where, in extreme cases, whole sentences can be packed into one word. An exhaustive list would require a more thorough investigation than just looking at orthography.

Without linguistic analysis (e.g., stemming—though these languages need much more complex analysis than that) these languages can be as difficult to search as spaceless languages.

As an example, Nahuatl is polysynthetic, and the word Nimitztētlamaquiltīz means "I shall make somebody give something to you." According to the gloss on enwiki, maki means "give", and seems to be realized as -maqui- in the word. Searching for "give" in Nahuatl would be... challenging, to say the least.

The languages listed below have Wikipedias and other projects, and are listed in English Wikipedia as being polysynthetic or agglutinative. This list is representative, but not exhaustive (i.e., these are generally languages with words that were long enough that I had to look them up to figure out what was going on).
 * Languages: Cherokee, Cheyenne, Cree, Inupiak, Inuktitut, Karakalpak, Greenlandic, Luganda, Malayalam, Nahuatl, Zulu
 * Codes: chr, chy, cr, ik, iu, kaa, kl, lg, ml, nah, zu