Page 1 of 1

Character Encoding

Posted: Tue Nov 17, 2009 5:09 am
by jayshields
Hi,

I've search around quite a bit and can't seem to find a solution to retrieving my Chinese characters from a unicode MySQL database table and displaying them properly.

I'm pretty sure it's PHP related because using the command line I can pull out stuff like this:

Code: Select all

+-------------------------+
| chinese                 |
+-------------------------+
| µ£¬??ô?ó??¬ì?Üä?¡öµíê
        |
| µôè????ÇÖ?úí?????î
          |
| ?????î
                  |
| ?Ç¥?¢??Ä??Ç¥
                |
| ?íî?ïò????áÿ
              |
| ?ü?µôç?¿ê?êåµî絿Ö
          |
| ?ªé??ò?ûï?ºï
              |
| ?à??¿ê?¡öµíê???µò?
          |
| ?à??¿ê?äí?¡öµíê???µò?
        |
| ??ò????òÅ?íî                |
| ?òÅ?íî
                  |
| ?Ŭ?«Ç                    |
+-------------------------+
12 rows in set (0.00 sec)
Which is probably the correct data but my Windows command line can't actually display it.

phpMyAdmin gives me the correct data when I "SELECT *" from the query window, so I guess my server is set up right (in terms of PHP config).

However, when I try to get the data in PHP on my own, all I can display is question marks, for example:

Code: Select all

<?php
 
header('Content-Type: text/html; charset=utf-8');
 
mysql_connect('localhost', 'root', 'xxx') or die(mysql_error());
mysql_select_db('blah') or die(mysql_error());
 
$result = mysql_query('SELECT * FROM `table`') or die(mysql_error());
 
while($row = mysql_fetch_assoc($result))
{
    print_r($row);
}
 
?>
Produces something like:

Code: Select all

Array ( [order] => 15 [key] => whatever [english] => Blah blah blah blah blah blah blah blah. [chinese] => ????????????,????????? [simp_chinese] => ????????????,?????????))
The odd ? (white question mark on a black background) is littered throughout the normal question marks too.

The same happens if I use the CodeIgniter database library (with which I have set

Code: Select all

$db['default']['char_set'] = "utf8";
$db['default']['dbcollat'] = "utf8_unicode_ci";
in the database config).

Some extra details:
- I'm using

Code: Select all

header('Content-Type: text/html; charset=utf-8');
to make sure UTF-8 encoding is used (and FireFox tells me that it is UTF-8 encoded data being received).
- My MySQL TEXT field is using the "utf8_unicode_ci" collation, on an InnoDB engine table.
- The table with the TEXT field is actually set to latin1_swedish_ci collation (I assume this is overridden by per-field settings - likewise with database-wide collation).
- FireBug tells me the Content-Type of the HTTP response is "text/html; charset=utf-8".
- I'm using this HTML header

Code: Select all

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
with this tag inside my <head>

Code: Select all

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
.
- As you can see I'm not serving the document as XML (like XHTML can be), this shouldn't make a difference (?).
- I can put one of my Chinese strings into a string directly in my PHP file and echo it out fine (meaning that it can't really be anything to do with my HTML/HTTP response?).

I'm basically trying to match what phpMyAdmin is getting in the HTTP response, but don't know what this is

Code: Select all

Vary:   Accept-Encoding
or if it could make a difference (I doubt it).

I'm stumped; as far as I can tell I've covered everything that is suggested anywhere I read about this type of stuff.

Re: Character Encoding

Posted: Tue Nov 17, 2009 5:11 am
by papa
Have you tried with different charsets?

http://a4esl.org/c/charset.html

Re: Character Encoding

Posted: Tue Nov 17, 2009 5:24 am
by iankent
I had a similar problem yesterday with a phpbb script I was writing - turned out to be this simple:

Code: Select all

 
mysql_query("SET NAMES utf8;");   
 
not sure if its the same problem, but the effects you describe are the same that I had!

hth

Re: Character Encoding

Posted: Tue Nov 17, 2009 5:30 am
by jayshields
iankent wrote:

Code: Select all

 
mysql_query("SET NAMES utf8;");   
 
Brilliant. That's that sorted. Thanks.

Why does no one else mention this on the numerous blogs about encoding with PHP/MySQL?

Re: Character Encoding

Posted: Tue Nov 17, 2009 5:40 am
by iankent
jayshields wrote:Why does no one else mention this on the numerous blogs about encoding with PHP/MySQL?
very good question lol, I honestly have no idea - such an important part and I only found out from some helpful guy on the phpbb forums... i'll definately be remembering it in future though :P

Re: Character Encoding

Posted: Tue Nov 17, 2009 5:57 am
by Apollo
In PHP, before displaying (printing) the supposedly Chinese strings, can you check them for utf-8 correctness?

For example, with: (source)

Code: Select all

function IsCorrectUtf8( $text )
{  
 return preg_match('%^(?: 
   [\x09\x0A\x0D\x20-\x7E]              # ASCII 
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte 
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs 
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte 
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates 
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3 
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15 
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16 
  )*$%xs', $text);
}

Re: Character Encoding

Posted: Tue Nov 17, 2009 6:00 am
by Apollo
Oh ok, good to see you already solved it :)
jayshields wrote:Why does no one else mention this on the numerous blogs about encoding with PHP/MySQL?
Somehow, character encoding is a completely mystified subject in many developers' minds..

Many people seem to ignore this subject and think "well it works on my system in my language, so I'll just assume it's OK everywhere", which usually isn't the case :)

Re: Character Encoding

Posted: Tue Nov 17, 2009 6:01 am
by iankent
Apollo wrote:Many people seem to ignore this subject
much like security methinks