Character Encoding

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
jayshields
DevNet Resident
Posts: 1912
Joined: Mon Aug 22, 2005 12:11 pm
Location: Leeds/Manchester, England

Character Encoding

Post by jayshields »

Hi,

I've search around quite a bit and can't seem to find a solution to retrieving my Chinese characters from a unicode MySQL database table and displaying them properly.

I'm pretty sure it's PHP related because using the command line I can pull out stuff like this:

Code: Select all

+-------------------------+
| chinese                 |
+-------------------------+
| µ£¬??ô?ó??¬ì?Üä?¡öµíê
        |
| µôè????ÇÖ?úí?????î
          |
| ?????î
                  |
| ?Ç¥?¢??Ä??Ç¥
                |
| ?íî?ïò????áÿ
              |
| ?ü?µôç?¿ê?êåµî絿Ö
          |
| ?ªé??ò?ûï?ºï
              |
| ?à??¿ê?¡öµíê???µò?
          |
| ?à??¿ê?äí?¡öµíê???µò?
        |
| ??ò????òÅ?íî                |
| ?òÅ?íî
                  |
| ?Ŭ?«Ç                    |
+-------------------------+
12 rows in set (0.00 sec)
Which is probably the correct data but my Windows command line can't actually display it.

phpMyAdmin gives me the correct data when I "SELECT *" from the query window, so I guess my server is set up right (in terms of PHP config).

However, when I try to get the data in PHP on my own, all I can display is question marks, for example:

Code: Select all

<?php
 
header('Content-Type: text/html; charset=utf-8');
 
mysql_connect('localhost', 'root', 'xxx') or die(mysql_error());
mysql_select_db('blah') or die(mysql_error());
 
$result = mysql_query('SELECT * FROM `table`') or die(mysql_error());
 
while($row = mysql_fetch_assoc($result))
{
    print_r($row);
}
 
?>
Produces something like:

Code: Select all

Array ( [order] => 15 [key] => whatever [english] => Blah blah blah blah blah blah blah blah. [chinese] => ????????????,????????? [simp_chinese] => ????????????,?????????))
The odd ? (white question mark on a black background) is littered throughout the normal question marks too.

The same happens if I use the CodeIgniter database library (with which I have set

Code: Select all

$db['default']['char_set'] = "utf8";
$db['default']['dbcollat'] = "utf8_unicode_ci";
in the database config).

Some extra details:
- I'm using

Code: Select all

header('Content-Type: text/html; charset=utf-8');
to make sure UTF-8 encoding is used (and FireFox tells me that it is UTF-8 encoded data being received).
- My MySQL TEXT field is using the "utf8_unicode_ci" collation, on an InnoDB engine table.
- The table with the TEXT field is actually set to latin1_swedish_ci collation (I assume this is overridden by per-field settings - likewise with database-wide collation).
- FireBug tells me the Content-Type of the HTTP response is "text/html; charset=utf-8".
- I'm using this HTML header

Code: Select all

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
with this tag inside my <head>

Code: Select all

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
.
- As you can see I'm not serving the document as XML (like XHTML can be), this shouldn't make a difference (?).
- I can put one of my Chinese strings into a string directly in my PHP file and echo it out fine (meaning that it can't really be anything to do with my HTML/HTTP response?).

I'm basically trying to match what phpMyAdmin is getting in the HTTP response, but don't know what this is

Code: Select all

Vary:   Accept-Encoding
or if it could make a difference (I doubt it).

I'm stumped; as far as I can tell I've covered everything that is suggested anywhere I read about this type of stuff.
User avatar
papa
Forum Regular
Posts: 958
Joined: Wed Aug 27, 2008 3:36 am
Location: Sweden/Sthlm

Re: Character Encoding

Post by papa »

Have you tried with different charsets?

http://a4esl.org/c/charset.html
User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Character Encoding

Post by iankent »

I had a similar problem yesterday with a phpbb script I was writing - turned out to be this simple:

Code: Select all

 
mysql_query("SET NAMES utf8;");   
 
not sure if its the same problem, but the effects you describe are the same that I had!

hth
User avatar
jayshields
DevNet Resident
Posts: 1912
Joined: Mon Aug 22, 2005 12:11 pm
Location: Leeds/Manchester, England

Re: Character Encoding

Post by jayshields »

iankent wrote:

Code: Select all

 
mysql_query("SET NAMES utf8;");   
 
Brilliant. That's that sorted. Thanks.

Why does no one else mention this on the numerous blogs about encoding with PHP/MySQL?
User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Character Encoding

Post by iankent »

jayshields wrote:Why does no one else mention this on the numerous blogs about encoding with PHP/MySQL?
very good question lol, I honestly have no idea - such an important part and I only found out from some helpful guy on the phpbb forums... i'll definately be remembering it in future though :P
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: Character Encoding

Post by Apollo »

In PHP, before displaying (printing) the supposedly Chinese strings, can you check them for utf-8 correctness?

For example, with: (source)

Code: Select all

function IsCorrectUtf8( $text )
{  
 return preg_match('%^(?: 
   [\x09\x0A\x0D\x20-\x7E]              # ASCII 
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte 
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs 
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte 
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates 
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3 
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15 
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16 
  )*$%xs', $text);
}
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: Character Encoding

Post by Apollo »

Oh ok, good to see you already solved it :)
jayshields wrote:Why does no one else mention this on the numerous blogs about encoding with PHP/MySQL?
Somehow, character encoding is a completely mystified subject in many developers' minds..

Many people seem to ignore this subject and think "well it works on my system in my language, so I'll just assume it's OK everywhere", which usually isn't the case :)
User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Character Encoding

Post by iankent »

Apollo wrote:Many people seem to ignore this subject
much like security methinks
Post Reply