Topic on Project:Support desk

Recommended character set and collation?

11
2003:C2:3F04:7000:1972:2771:5008:44D (talkcontribs)

Howdy,


we are running a wiki for more than 15 years now. Currently we are using MW 1.35 LTS, MySQL 8.0, PHP 7.2. Unfortunately the database tables and columns use an odd mixture of character sets (some latin1, some utf8) as well as various collations (latin1_bin, utf8_bin, utf8_general_ci, even some latin1_swedish_ci ... for whatever reasons) - this chaos happened long before I joined the wiki. Simply no one took care.


To cleanup the mess, I consider exporting all data and re-import it with a certain character set and collation. I believe I know how to do it, but I am wondering which character set and collation to use. What are the recommended defaults?


For test purposes, I created a new empty wiki using install.php, but I don't understand the result. The tables are created with character set 'binary' (except searchindex = 'utf8'), and the column's collation NULL. Could this be the default?

Jonathan3 (talkcontribs)
Bawolff (talkcontribs)

Generally yes - using binary collation with null charset is the default. Some people alternatively use utf8mb4 as the charset, and a unicode collation like utf8mb4_0900_as_cs (or a regional variant).


You might be able to use ALTER statements to change this without reimporting but you have to be careful none of the current data gets converted and im not sure how alter handles that. Whatever you do keep backups and have test pages that have accents and emoijis in the title to ensure they dont break.

2003:C2:3F04:7000:653A:4544:CCD8:41F4 (talkcontribs)

Luckily we do have a complete test wiki. I will have to perform some more tests. Meanwhile thanks for your answers!

2003:C2:3F04:7000:BD3F:81F4:1009:92AD (talkcontribs)

Difficult task. Here are some intermediate results.


First I tried ALTER TABLE to binary. Maybe this will work with the Mediawiki application, but there is one reason why I did not continue to explore. Binary means we will not be able to read the contents of text columns in the MySQL shell. Example:

mysql> alter table category convert to character set binary;

mysql> select * from category limit 1;

+--------+----------------------+-----------+-------------+-----------+

| 151434 | 0x4C697374656E       |       599 |           3 |         0 |

+--------+----------------------+-----------+-------------+-----------+

1 row in set (0,00 sec)


The result is not practical for me, so I stopped here.

Next idea was ALTER TABLE to utfmb4. This might work with Mediawiki, - I don't know -, but once more it did not solve my original problem within MySQL shell. Any special characters e.g. german umlauts appear to be cryptic glyphs in the shell, before ALTER TABLE as well as after.

mysql> alter table category convert to character set utf8mb4;

mysql> select * from category where cat_id=151642;

+--------+-----------+-----------+-------------+-----------+

| 151642 | Städte   |      2505 |           0 |         0 |

+--------+-----------+-----------+-------------+-----------+

1 row in set (0,00 sec)

mysql> select * from category where cat_title='Städte';

Empty set (0,00 sec)

mysql> select * from category where cat_title like 'St%dte';

+--------+-----------+-----------+-------------+-----------+

| 151642 | Städte   |      2505 |           0 |         0 |

+--------+-----------+-----------+-------------+-----------+

1 row in set (0,00 sec)


To solve the special characters issue, it seems to be necessary to first convert to binary and then in a second step to utf8mb4.

mysql> alter table category convert to character set binary;

mysql> alter table category convert to character set utf8mb4;

This is only half of the jackpot. The conversion to binary changes the columns' data types to varbinary, so the second ALTER just sets the table's defaults. It is necessary to restore the columns:

mysql> alter table category modify cat_title varchar(255) character set utf8mb4 unique not null default 'NULL';

Now everything is fine within the MySQL shell:

mysql> select * from category where cat_title='Städte';

+--------+-----------+-----------+-------------+-----------+

| 151642 | Städte    |      2505 |           0 |         0 |

+--------+-----------+-----------+-------------+-----------+

1 row in set (0,00 sec)


The double conversion is not my invention, I found it on the web:

//nicj.net/mysql-converting-an-incorrect-latin1-column-to-utf8/

Other users seem to have the same problem. It is a "devil of a job" to manually convert each single table and column, especially because we'll have to restore any single column attribute which is very error-prone. Therefore the script this guy developed might be useful (//github.com/nicjansma/mysql-convert-latin1-to-utf8).

Currently I am running some tests with this script. Still having issues. There are a few duplicate keys, which I need to eliminate before doing the conversion, not a big issue. Some of the columns seem to be left unchanged for whatever reason. The drama is going on ...

Jonathan3 (talkcontribs)

Thanks for reporting back! I'm reading with interest.

2003:C2:3F04:7000:E9C9:486C:B6FB:98EA (talkcontribs)

Conversion of the test wiki succeeded! We have had some duplicate keys in the page and pagelinks table which I had to cleanup before I could finish the task. (Think of page names like "Cäsar" and "CÄSAR"). Another issue was the searchindex. My solution: Drop the table before running the conversion script, then after conversion I imported an empty searchindex table, modified charset/collation and recreated the index by rebuildtextindex.php


Now we are testing, testing, testing.


All tables, all columns with old character sets and collations have been changed. Great script. Now there is only one small thing left. Newer tables like e.g. actor use only binary types. (Though $wgDBTableOptions = ENGINE=InnoDB, DEFAULT CHARSET=utf8 ...!) I'm turning towards other things now and may get back to the binary columns another time.

2003:C2:3F04:7000:20F1:2D93:3B5D:D6BD (talkcontribs)

One last hint for whoever may be interested in the results. The above-mentioned script changed the text table, old_text column from mediumtext latin1 to mediumtext utf8mb4 just as intended. This has one little disadvantage. The search/replace feature cannot handle REGEXP, then. The result will always be "Error 3995: Character set [...] cannot be used in conjunction with 'binary' in call to regexp_like"


Solution: ALTER TABLE `ppdb`.`text` MODIFY `old_text` mediumblob NOT NULL;

2003:C2:3F30:A000:55E0:6D6D:CE8:3CD5 (talkcontribs)

And one more hint, if s.o. decides to follow the above-mentioned procedure, then you should take a look at Topic:Wxuxqpbsg2o7c1l3

Valerio Bozzolan (talkcontribs)

Should this information (preferred charset = binary / utf8mb4_0900_as_cs) be more visible on the MediaWiki installation page?

2003:C2:3F21:FD00:682B:745E:5DA7:60C2 (talkcontribs)

No. This is a very individual solution for our wiki, I cannot recommend it for general usage

e.g. we now did the upgrade to mediawiki 1.39 and the update.php script that we are forced to call (otherwise the installation is not complete) modifys lots of database columns back to varbinary. They shouldn't do that but they do.

See Topic:Y0o7zgslstk0stfi

Reply to "Recommended character set and collation?"