I'm tying myself up in knots trying to get UTF-8, PHP and MySQL working fine while attempting to upgrade an existing latin1 database to utf-8.
Original database settings:
Database: latin1_swedish_ci
Tables and fields: latin1_swedish_ci
New settings:
Database: utf8_unicode_ci
Tables and fields: utf8_unicode_ci
My PHP code, regardless of the database settings has always set the server headers and HTML META tag to utf8:
@header('Content-type: text/html; charset=utf-8');
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
In both cases I had and still have names displayed as:
ÄŽuroviÄová, NataÅ¡a
Böhm, Steffan
Böhme, Gernot
Böttcher, Niels
Balázs, Bela
Chateau, Nöel
Clair, René
Dahan, Kévin
Fernández-Vara, Clara
Gärdenfors, Dan
Güttler, Christian
Gröhn, Matti
Grönlund, Bo
Hörberg, Ulf
Jørgensen, Kristine
Järvinen, Aki
Kücklich, Julian
Keller, Damián
Kindström, Mattias
Lévy, Pierre
Laliberté, Martin
Mäyrä, Frans
Penz, François
Röber, Niklas
Sánchez, Jamie
Sunnanå, Lise
Théberge, Paul
Västfjäll, Daniel
Weske, Jörg
Zagal, José P.
Zagal, José Pablo
There were stored via PHP/form entry originally in the latin1 database and, because of the server header and meta tags above, I had no problems displaying the UTF-8 characters above correctly in the web browser.
However, becase the MySQL character sets/collation were not utf8, the names above were not correctly ordered -- particularly noticeable with any non-ASCII character at the start of the name leading the name to be displayed right at the start of any list. Hence the wish to move to a UTF-8 database (although I understand there are still problems with REGEXP and UTF-8).
In addition to the changes noted above for the database, tables and fields, my PHP scripts, following database connection, now send the following commands to the MySQL server:
SET NAMES utf8
SET CHARACTER SET utf8
I have verified (I believe) that the names above (including the first one in the list which seems problematic -- see below) in the database do contain utf8 characters with the following code:
Code: Select all
public function detectUtf8($string)
{
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)+%xs', $string);
}??urovi?ová, Nataša
Böhm, Steffan
Böhme, Gernot
Böttcher, Niels
Balázs, Bela
Chateau, Nöel
Clair, René
Dahan, Kévin
Fernández-Vara, Clara
Gärdenfors, Dan
Güttler, Christian
Gröhn, Matti
Grönlund, Bo
Hörberg, Ulf
Jørgensen, Kristine
Järvinen, Aki
Kücklich, Julian
Keller, Damián
Kindström, Mattias
Lévy, Pierre
Laliberté, Martin
Mäyrä, Frans
Penz, François
Röber, Niklas
Sánchez, Jamie
Sunnanå, Lise
Théberge, Paul
Västfjäll, Daniel
Weske, Jörg
Zagal, José P.
Zagal, José Pablo
Almost there but the first name is still not correct -- the surname should be ?urovi?ová.
If the PHP-originated commands sent to the MySQL server are:
SET NAMES latin1
SET CHARACTER SET latin1
and the PHP scripts have no utf8_decode() on the MySQL server output, then all the names are correctly displayed (albeit the order is wrong -- see below). I haven't yet tried storing PHP/Form input in the database with the new settings (I'm just reading back to the browser) so combining latin1 and utf8 may present problems.
It is worth noting that, although MySQL sorting with the new UTF-8 collation has changed the order somewhat from latin1 collation, the names are still not correctly ordered. For example, the first one in the list above comes at the end of 'A' and the 'B-umlaut...' names appear before 'Ba...'
Any help would be most appreciated.