PHP, MySQL, UTF-8

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

sirfragalot
Forum Newbie
Posts: 14
Joined: Wed Jan 12, 2005 1:42 pm

PHP, MySQL, UTF-8

Post by sirfragalot »

Hi all,

I'm tying myself up in knots trying to get UTF-8, PHP and MySQL working fine while attempting to upgrade an existing latin1 database to utf-8.

Original database settings:
Database: latin1_swedish_ci
Tables and fields: latin1_swedish_ci

New settings:
Database: utf8_unicode_ci
Tables and fields: utf8_unicode_ci

My PHP code, regardless of the database settings has always set the server headers and HTML META tag to utf8:
@header('Content-type: text/html; charset=utf-8');
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

In both cases I had and still have names displayed as:
Ďurovičová, Nataša
Böhm, Steffan
Böhme, Gernot
Böttcher, Niels
Balázs, Bela
Chateau, Nöel
Clair, René
Dahan, Kévin
Fernández-Vara, Clara
Gärdenfors, Dan
Güttler, Christian
Gröhn, Matti
Grönlund, Bo
Hörberg, Ulf
Jørgensen, Kristine
Järvinen, Aki
Kücklich, Julian
Keller, Damián
Kindström, Mattias
Lévy, Pierre
Laliberté, Martin
Mäyrä, Frans
Penz, François
Röber, Niklas
Sánchez, Jamie
Sunnanå, Lise
Théberge, Paul
Västfjäll, Daniel
Weske, Jörg
Zagal, José P.
Zagal, José Pablo

There were stored via PHP/form entry originally in the latin1 database and, because of the server header and meta tags above, I had no problems displaying the UTF-8 characters above correctly in the web browser.

However, becase the MySQL character sets/collation were not utf8, the names above were not correctly ordered -- particularly noticeable with any non-ASCII character at the start of the name leading the name to be displayed right at the start of any list. Hence the wish to move to a UTF-8 database (although I understand there are still problems with REGEXP and UTF-8).

In addition to the changes noted above for the database, tables and fields, my PHP scripts, following database connection, now send the following commands to the MySQL server:
SET NAMES utf8
SET CHARACTER SET utf8

I have verified (I believe) that the names above (including the first one in the list which seems problematic -- see below) in the database do contain utf8 characters with the following code:

Code: Select all

   public function detectUtf8($string)
    {
            return preg_match('%(?:
            [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
            |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
            |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
            |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
            |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
            |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
            |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
            )+%xs', $string);
    }
Without using utf8_decode() on the name from the database, names are displayed in the web browser as in the list above, using utf8_decode(), names are displayed as below:

??urovi?ová, Nataša
Böhm, Steffan
Böhme, Gernot
Böttcher, Niels
Balázs, Bela
Chateau, Nöel
Clair, René
Dahan, Kévin
Fernández-Vara, Clara
Gärdenfors, Dan
Güttler, Christian
Gröhn, Matti
Grönlund, Bo
Hörberg, Ulf
Jørgensen, Kristine
Järvinen, Aki
Kücklich, Julian
Keller, Damián
Kindström, Mattias
Lévy, Pierre
Laliberté, Martin
Mäyrä, Frans
Penz, François
Röber, Niklas
Sánchez, Jamie
Sunnanå, Lise
Théberge, Paul
Västfjäll, Daniel
Weske, Jörg
Zagal, José P.
Zagal, José Pablo

Almost there but the first name is still not correct -- the surname should be ?urovi?ová.

If the PHP-originated commands sent to the MySQL server are:
SET NAMES latin1
SET CHARACTER SET latin1

and the PHP scripts have no utf8_decode() on the MySQL server output, then all the names are correctly displayed (albeit the order is wrong -- see below). I haven't yet tried storing PHP/Form input in the database with the new settings (I'm just reading back to the browser) so combining latin1 and utf8 may present problems.

It is worth noting that, although MySQL sorting with the new UTF-8 collation has changed the order somewhat from latin1 collation, the names are still not correctly ordered. For example, the first one in the list above comes at the end of 'A' and the 'B-umlaut...' names appear before 'Ba...'

Any help would be most appreciated.
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: PHP, MySQL, UTF-8

Post by dml »

To narrow it down to a small test case, can you do the following check on the field with "Böhm, Steffan", replacing @a with the actual field name. This is the output you should get:

Code: Select all

 
mysql> SELECT @a, charset(@a), hex(@a), char_length(@a), length(@a)\G
*************************** 1. row ***************************
             @a: Böhm, Steffan
    charset(@a): utf8
        hex(@a): 42C3B6686D2C205374656666616E
char_length(@a): 13
     length(@a): 14
 
It looks like somewhere in the conversion, utf8 data was processed as latin1. This fits with the behavior you describe, including the disappearance of "?", which isn't in latin1.
sirfragalot
Forum Newbie
Posts: 14
Joined: Wed Jan 12, 2005 1:42 pm

Re: PHP, MySQL, UTF-8

Post by sirfragalot »

Hi Dml,

Thanks for taking the time to respond.

I have managed to get a little bit further. By reading each Latin1 field entry, using utf8_decode(), then re-inserting the value into the field AFTER changing the charset/collation of the table to utf8/utf8_unicode_ci, I can now get the sorting of names with MySQL correct. Furthermore, most characters now display correctly on the web browser with the exception of the "?" and other characters which seem to be close equivalents to single or double quotes (the database is full of entries copied and pasted and entered via the web form from all sorts of programs). This method introduces a further problem in that wherever the mysql insert comes across such a problem character, the field is truncated at that point.

This is all for an OS PHP script I write so needs a solution that will handle a range of latin1 databases around the world. I need to be able to write a generalized PHP script to upgrade these existing databases when people upgrade the PHP script -- I cannot come up with a solution that just handles the problem characters in my database.

I discovered that re-inserting the database field values after converting the character set and collation of tables was the solution by editing my latin1 SQL dump's charsets. e.g.:

Code: Select all

DROP TABLE IF EXISTS `WKX_creator`;
CREATE TABLE `WKX_creator` (
  `id` int(11) NOT NULL auto_increment,
  `surname` varchar(255) default NULL,
  `firstname` varchar(255) default NULL,
  `initials` varchar(255) default NULL,
  `prefix` varchar(255) default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=631 CHARSET=utf8;
 
INSERT INTO `WKX_creator` VALUES (1,'Aarseth','Espen',NULL,NULL),(2,'Smedstad','Solveig Marie',NULL,NULL),(3,'Sunnanå','Lise',NULL,NULL),  etc. etc. etc.
Importing this edited dump into the database led to absolutely flawless results in my PHP script -- correct MySQL ordering and correct display of all characters including the 'problem' characters above. However, as mentioned above, I cannot expect all users of my software to manually upgrade their database and have been trying to recreate the effects of the edited SQL dump in a PHP script with the limited success described above.

Anyway, with your test (on the database that has had its tables converted to utf8 and into which I have dumped the utf8-edited SQL dump described above -- i.e. it all works brilliantly in my software but I can't use this method as explained above), I get the following (hex(Böhm) is different to yours because I only have the surname in the field):

creatorSurname Böhm ?urovi?ová
charset(`creatorSurname`) utf8 utf8
hex(`creatorSurname`) 42C3B6686D C48E75726F7669C48D6F76C3A1
char_length(`creatorSurname`) 4 10
length(`creatorSurname`) 5 13

Using the unedited Latin1 SQL dump and changing the collation to utf8 but without re-inserting database fields (the browser displays all UTF8 characters incorrectly as in the first list in my first post), I get the following:

creatorSurname Böhm Ďurovičová
charset(`creatorSurname`) utf8 utf8
hex(`creatorSurname`) 42C383C2B6686D C384C5BD75726F7669C384C28D6F76C383C2A1
char_length(`creatorSurname`) 5 13
length(`creatorSurname`) 7 19

Finally, using the unedited Latin1 SQL dump, changing the collation to UTF8 and reinserting all fields with the utf8_decode() value, I get the following:

creatorSurname Böhm
charset(`creatorSurname`) utf8 utf8
hex(`creatorSurname`) 42C3B6686D
char_length(`creatorSurname`) 4 0
length(`creatorSurname`) 5 0

Note that the second surname did not get written at all or at least is blank (other elements for that row were written and display correctly such as the firstname Nataša). This is the insertion truncation at problem characters I mentioned above.
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: PHP, MySQL, UTF-8

Post by dml »

It looks like the database dump is in utf8 but it's getting read in as latin1. This should be fixable by inserting "set names utf8" on top of the dump file. A better solution is not to let the characters get outside the database at all - use ALTER TABLE CONVERT TO CHARACTER SET.

The utf8_decode method is likely to lead to more problems, as it converts to iso-8859-1, which isn't the same as Mysql's latin1 - the differences include the quotation mark variants you mention.

I'm puzzled by how the "Ď" character got into your original latin1 data, as it doesn't seem to be included in that character set.
sirfragalot
Forum Newbie
Posts: 14
Joined: Wed Jan 12, 2005 1:42 pm

Re: PHP, MySQL, UTF-8

Post by sirfragalot »

The answer about the "Ď" may be because I've always had my PHP script's headers set to UTF-8 even when storing form data in a latin1 database.

However, with that database dump file (which does have 'set names utf8' at the top) everything works perfectly including the mysterious "Ď". However, I _cannot_ use the dump file method -- I have to find a solution in PHP that will handle whatever database is thrown at it.

I'm getting closer to a solution in the PHP script.

The sequence I have so far is:

SET NAMES <whatever the original charset> // probably latin1
read data from table // NB no utf8 encoding/decoding
SET NAMES utf8
ALTER TABLE <character set + collation to utf8>
write data to table
.
.
.
repeat for next table.

I no longer get field truncation with problem characters but still have the "Ď" showing up as '?'.

I've just been looking through this:
http://codex.wordpress.org/Converting_D ... nd_Plugins

which describes a similar problem and proposes a solution. I'll give that a try and report back here.

I know it is possible to get it all working because the utf8-edited SQL dump works perfectly in all respects.
sirfragalot
Forum Newbie
Posts: 14
Joined: Wed Jan 12, 2005 1:42 pm

Re: PHP, MySQL, UTF-8

Post by sirfragalot »

Well, the advice given at the link I posted above seems to work even with my problem characters.

However....

I've noticed that I have one field (of type TEXT) that has the following in the original latin1 database:

"Williams equates the sudden success of 1929 talkies (after 3 decades of recorded sound syncronisation attempts) to the contemporary decline in the public\'s taste for melodrama. Sound brought about the decline of German Expressionism.

Also, some discussion of linguistic hegemony (English) brought about by sound cinema. One of the reasons, apparently, why the advent of sound did not change the Hollywood model of cinema that much but did dramatically change European cinema. (Taken up in the next essay by Nataša &#270;urovi&#269;ová.)"

Somehow unicode(?) seems to have got in there (source can be copy 'n' pasted from anywhere).

After conversion to utf8 as in the link I used (from TEXT to BLOB and back to TEXT again), that particular field (and I haven't noticed any others truncating) truncated on the "š" producing:

"Williams equates the sudden success of 1929 talkies (after 3 decades of recorded sound syncronisation attempts) to the contemporary decline in the public\'s taste for melodrama. Sound brought about the decline of German Expressionism.

Also, some discussion of linguistic hegemony (English) brought about by sound cinema. One of the reasons, apparently, why the advent of sound did not change the Hollywood model of cinema that much but did dramatically change European cinema. (Taken up in the next essay by Nata"

Any ideas?
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: PHP, MySQL, UTF-8

Post by dml »

So if I understand correctly, the data was encoded as utf8-on-top-of-latin1, ie what happens when mysql is set up with its latin1 defaults, but the web headers are set up with utf8. This can indeed be fixed by dumping to binary and reading in as utf8, as you've done.

This technique should have worked for "Nataša" as well, assuming that everything was in utf8-on-top-of-latin1. Can you do a hex(substring(...)) on the field, with the substring() params set up to extract the "Nataša" part? You should expect to see the following, with the problem character encoded as c5a1:

Code: Select all

 
mysql> SELECT @a, hex(@a)\G
*************************** 1. row ***************************
     @a: Nataša
hex(@a): 4E617461C5A161
1 row IN SET (0.00 sec)
If the š is encoded as something other than c5a1, for example if it's encoded as 9A as it is in cp1252, then that would have made the utf8 processor stop at that point.
sirfragalot
Forum Newbie
Posts: 14
Joined: Wed Jan 12, 2005 1:42 pm

Re: PHP, MySQL, UTF-8

Post by sirfragalot »

Hi dml,

Thanks very much for your patience with this.

Your understanding is correct. utf8 encoded on top of latin with web headers set to utf8 has been how I've supported a UTF8-like environment in pre-v4.2.1 MySQL servers.

Using PHPMyAdmin so I can see the web browser display immediately:

In my latin1 database, the author name shows as (i.e. it was stored like this as explained above but my PHP scripts displayed it correctly):
surname Ďurovičová
firstname Nataša

In the converted utf8 it now shows correctly as:
creatorSurname Ďurovičová
creatorFirstname Nataša

In the latin1 database, the block of text shows as (double quotes mine):
"Williams equates the sudden success of 1929 talkies (after 3 decades of recorded sound syncronisation attempts) to the contemporary decline in the public\'s taste for melodrama. Sound brought about the decline of German Expressionism.

Also, some discussion of linguistic hegemony (English) brought about by sound cinema. One of the reasons, apparently, why the advent of sound did not change the Hollywood model of cinema that much but did dramatically change European cinema. (Taken up in the next essay by Nataša &#270;urovi&#269;ová.)"

and in the converted utf8 database it shows as:
"Williams equates the sudden success of 1929 talkies (after 3 decades of recorded sound syncronisation attempts) to the contemporary decline in the public\'s taste for melodrama. Sound brought about the decline of German Expressionism.

Also, some discussion of linguistic hegemony (English) brought about by sound cinema. One of the reasons, apparently, why the advent of sound did not change the Hollywood model of cinema that much but did dramatically change European cinema. (Taken up in the next essay by Nata"

Ineterestingly, I used PHPMyAdmin to manually complete the steps for utf8 conversion (TEXT to BLOB etc.) that I have in my PHP script and ended up with the same truncation following 'Nata' (so presumably my PHP script is not at fault).

Using this on the latin1 database field:
SELECT substring(`text`, locate('Nat', `text`), 6), hex(substring(`text`, locate('Nat', `text`), 6)) FROM `wkx_resource_note` WHERE `id` = '110'

I get:
substring(`text`, locate('Nat', `text`), 6) Nataša
hex(substring(`text`, locate('Nat', `text`), 6)) 4E6174619A61

So, it does seem to be cp1252 -- like I said, form input can be copied and pasted from anywhere.

Is there anyway around this?
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: PHP, MySQL, UTF-8

Post by dml »

It's a tricky one. What you have to hope for is that there's a pattern, for example the author name field is always utf8 and the description field is always latin1. If some of the descriptions are in latin1 and others are in utf8, then it's a lot harder to automate the conversion, and it has to be based on some heuristic for guessing the encoding based on expected patterns of characters.

Just to get an idea if the "Nataša" problem is turning up for other rows, you can do a search for fields for which the conversion didn't work: convert(convert(a using binary) using utf8) replicates dumping to the blob and reading back into utf8, and then you compare the binary representation of the original field.

Code: Select all

 
SELECT a FROM t WHERE
hex(a)<>
hex(convert(convert(a USING BINARY) USING utf8))
 
sirfragalot
Forum Newbie
Posts: 14
Joined: Wed Jan 12, 2005 1:42 pm

Re: PHP, MySQL, UTF-8

Post by sirfragalot »

I can't hope for any pattern. Form data can come from anywhere and I have to be able to cope with this upgrade from a PHP script that will handle not just my database but many others. I need the generic solution.

I'll try your convert() test tomorrow. It might be a way to test for corrupted fields and to alert my PHP script users at upgrade or even, this would be good, to pass that section of the field unconverted during the UTF-8 upgrade.
sirfragalot
Forum Newbie
Posts: 14
Joined: Wed Jan 12, 2005 1:42 pm

Re: PHP, MySQL, UTF-8

Post by sirfragalot »

OK - done the convert test on the latin1 database:

Code: Select all

SELECT `resourcenoteText` 
FROM `wkx_resource_note` 
WHERE hex( `resourcenoteText` ) <> hex( convert( convert( `resourcenoteText` 
USING BINARY ) 
USING utf8 ) )
Surprisingly, I get an empty result set (`resourcenoteText` is the field which has the truncated 'Nata' problem). Just to confirm, the problem field is still there ending in "...Nataša &#270;urovi&#269;ová.)".

I was hoping I might be able to identify fields with potential problems, store them, convert all fields/tables to utf8 then update those stored fields with the original values. Obviously not.
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: PHP, MySQL, UTF-8

Post by dml »

Yep, tried it myself and it doesn't work. So much for that. Here's another angle: the "Nataša" field breaks the rules for a valid sequence of utf8 bytes, and that invalidity can be checked for automatically in PHP or using a mysql stored function (assuming you're using mysql5). Of course, there are caveats - just because it's not invalid by this test doesn't mean it doesn't have other errors. Here's a rough example of a stored function that finds invalid sequences - if you do select where is_invalid_utf8(field) it might catch "Nataša" and other problem entries.

Code: Select all

 
DROP FUNCTION IF EXISTS is_invalid_utf8;
delimiter $$
CREATE FUNCTION is_invalid_utf8(s BLOB)
RETURNS int
DETERMINISTIC
  BEGIN
    DECLARE i INT DEFAULT 1;
    DECLARE c INT;
    DECLARE num_expected_highs INT DEFAULT 0;
    WHILE (i <= length(s)) DO
      SET c = ord(mid(s, i, 1));
      IF num_expected_highs > 0 THEN
        IF (c & 192)<>128 THEN  -- should start with 10...
          RETURN 1;
        END IF;
        SET num_expected_highs = num_expected_highs - 1;
      ELSE
        IF (c & 224) = 192 THEN -- starts with 110...
          SET num_expected_highs = 1;
        ELSEIF (c & 240) = 224 THEN  -- starts with 1110...
          SET num_expected_highs = 2;
        ELSEIF (c & 248) = 240 THEN  -- starts with 11110...
          SET num_expected_highs = 3;
        ELSEIF (c & 128 = 128) THEN -- starts with 1...
          RETURN 1;
        END IF;
      END IF;
      SET i = (i+1);
    END WHILE;
    RETURN num_expected_highs;
  END
$$
delimiter ;
 
SELECT 
  is_invalid_utf8("foo")=0, -- valid
  is_invalid_utf8(0x91)<>0,  -- invalid: latin1 š
  is_invalid_utf8(0xc581)=0, -- valid: utf8 š
  is_invalid_utf8(0xc5)<>0, -- invalid
  is_invalid_utf8(0x81)<>0 -- invalid
\G
 
 
sirfragalot
Forum Newbie
Posts: 14
Joined: Wed Jan 12, 2005 1:42 pm

Re: PHP, MySQL, UTF-8

Post by sirfragalot »

Hi dml,

Well, that certainly finds the offending field. So, I'm now experimenting using that stored procedure in a PHP script to find offending fields, store the original data, convert tables to UTF8 then update the offending field with the old data. Only problem is, I'm having difficulty translating that stored procedure into a PHP string -- I keep getting MySQL syntax errors. Does the stored procedure require the newlines as you've given them?
sirfragalot
Forum Newbie
Posts: 14
Joined: Wed Jan 12, 2005 1:42 pm

Re: PHP, MySQL, UTF-8

Post by sirfragalot »

Yep, that works. I know have the complete text back in the UTF-8 database without truncation. Still battling over getting PHP to write the stored procedure.
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: PHP, MySQL, UTF-8

Post by dml »

The stored procedure syntax is very fragile alright. I'm not sure about the difference newlines make... can you try it with a multiline string?

Code: Select all

 
execute_sql("
 CREATE FUNCTION xis_invalid_utf8(s BLOB)
 RETURNS int
 DETERMINISTIC
   BEGIN
     ...
   END
");
 
Post Reply