Best practice character set encoding

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Best practice character set encoding

Post by matthijs »

Searching through the forums I find many discussions about character set problems. After having done some research I have a basic understanding of the issue now. However, I wondered what people advice as best practice nowadays.

I understand that with using UTF-8 you have the best chance of being able to work well with a wide range of (international) characters. However, a lot of information is (still) encoded as latin-1 (ISO-8859-1). Which leads to the familiar problems of getting weird characters on your web pages.

My gut feeling is that I should try to stick with UTF-8 everywhere. But, as I said, some data in my database might still be latin-1. Or new data coming in might be latin-1.
For example, in one of my projects I got handed a spreadsheet with data which I imported in my database. The data was latin-1. Should I be pragmatic and just set a

Code: Select all

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
in my webpage template and be done with it? Or should I try to convert the data to utf-8?
(and then use <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />)

I also discovered that the zend framework doesn't send a specific header setting the character set to UTF-8. Is that on purpose? Or is that only done by the server?

[edit:]

Another issue apparantly is browsers
http://dev.mysql.com/tech-resources/art ... icode.html
If your HTML page contains a form, browsers will generally send the results back in the character set of the page. So if your page is sent in UTF-8, you will (usually) get UTF-8 results back. The default encoding of HTML documents is ISO-8859-1, so by default you will get form data encoded as ISO-8859-1, with one big exception: some browsers (including Microsoft Internet Explorer and Apple Safari) will actually send the data encoded as Windows-1252, which extends ISO-8859-1 with some special symbols, like the euro (€) and the curly quotes (“”).
Does this mean that if people use Internet Explorer, they are going to send Windows-1252 anyway, regardless of what I try to set as character encoding? If that's the case I might as well forget using UTF-8
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Re: Best practice character set encoding

Post by matthijs »

Interesting, thanks.
lkjkorn19 wrote: * If you want to switch from UTF-8 to ISO-8859-1 and vice versa, use the functions utf8_encode and utf8_decode. The real problem would be to detect in which encoding the data is, I guess.
I suspect my data is in ISO-8859-1, because when I set
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
the characters display fine.
lkjkorn19 wrote:You can use the accept-charset attribute in <form> and your forms will be sent in the charset you specified therein, (e.g. <form accept-charset="utf-8" ...> will ALWAYS send the data in UTF-8, even if you set the encoding of the page to ISO-8859-1 (or anything else)).
I understand that this is possible. But until today I have never seen this practice before. So I wonder, is nobody using this? And why not? Is it worth the trouble of doing this for all my forms?

Another question is, how to convert data in the database from one encoding to another. Simply doing:

Code: Select all

 
ALTER TABLE `account` CHANGE `email` `email` VARCHAR( 50 ) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL
is not actually converting the data in the fields itself. It's only setting the character set and collate of the table.
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Re: Best practice character set encoding

Post by matthijs »

Strangely enough, I discovered that the data in the db in fact seems to be utf-8.

If I go into phpMyAdmin to the db table, copy a string with a special character from one of the fields, say:
$string = 'Jöris';
and then check if it's UTF-8 with the following function:

Code: Select all

 
function is_utf8($string) {
  
    // From http://w3.org/International/questions/q ... utf-8.html
    return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs', $string);
  
}
It says that it is utf-8.


However, if this same string gets pulled out of the db into the webpage, if that webpage has
<meta http-equiv="content-type" content="text/html; charset=utf-8">
it turns out as J?ris

If I set
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
it turns out correct as
Jöris

Maybe the data gets converted some where along the way?

Doing SHOW VARIABLES LIKE 'c%'; gives:

Code: Select all

 
Variable_name   Value 
character_set_client   utf8 
character_set_connection   utf8 
character_set_database   latin1 
character_set_?lesystem   binary 
character_set_results   utf8 
character_set_server   latin1 
character_set_system   utf8 
collation_connection   utf8_unicode_ci 
collation_database   latin1_swedish_ci 
collation_server   latin1_swedish_ci
It seems the database default is latin1 and so even though I have set the specific database table to utf8 and collate utf8_general_ci it gets overruled or transformed.
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Re: Best practice character set encoding

Post by matthijs »

I wished I had never dived into this matter 8O

Code: Select all

 
// The string is utf8
$string = 'Jöris';
echo strlen($string); // returns 6!!
Trying to go for utf-8 is nice, but you have to deal with a lot of issues. A lot of php functions don't work as expected when changing to utf-8:
http://www.phpwact.org/php/i18n/utf-8
User avatar
kaisellgren
DevNet Resident
Posts: 1675
Joined: Sat Jan 07, 2006 5:52 am
Location: Lahti, Finland.

Re: Best practice character set encoding

Post by kaisellgren »

matthijs wrote:

Code: Select all

 
// The string is utf8
$string = 'Jöris';
echo strlen($string); // returns 6!!
An expected result. The function strlen() returns the size of the text. For instance, having one 2-byte character will make it to return 2.
matthijs wrote:If I go into phpMyAdmin to the db table, copy a string with a special character from one of the fields, say:
$string = 'Jöris';
and then check if it's UTF-8
How do you copy the string? Highlight it, copy and paste?
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Re: Best practice character set encoding

Post by matthijs »

kaisellgren wrote:
matthijs wrote:

Code: Select all

 
// The string is utf8
$string = 'Jöris';
echo strlen($string); // returns 6!!
An expected result. The function strlen() returns the size of the text. For instance, having one 2-byte character will make it to return 2.
Yes indeed, expected. But not something I had thought about much before. Until now I knew I had to have the right headers send and set the right meta tag in my html. Put it in and be done with it, so to say. Everybody was using utf-8 so I did the same. But I didn't know that if you're going to use utf-8 (for real) in your code, a whole bunch of PHP functions don't work as expected anymore (that basically means you can't use them)

At this moment I feel like I just opened Pandora's box.

What I also find strange is that there's so little being talked about working with unicode in PHP. I just checked a couple of PHP security books I have and the words "unicode" or "character_set" or "utf-8" are hardly mentioned (if at all). I was going through the documentation of the Zend framework but can hardly find any serious info on the subject. I vaguely remember reading a few blog posts about "full unicode support in php6", but never paid much attention to that.

While the implications can be pretty big if something in your application about the character sets is not correct. A few characters being replaced by question marks in your html are the least of your problems. If functions don't work as expected your code could have bugs and security breaches you weren't aware of.

At this point I think I'd better not try to work with utf-8 and just stick with latin-1 for the moment.

Or am I missing something?
kaisellgren wrote:
matthijs wrote:If I go into phpMyAdmin to the db table, copy a string with a special character from one of the fields, say:
$string = 'Jöris';
and then check if it's UTF-8
How do you copy the string? Highlight it, copy and paste?
Yes. But it's the same when I go into my texteditor and type that string and save it (as utf-8)
User avatar
kaisellgren
DevNet Resident
Posts: 1675
Joined: Sat Jan 07, 2006 5:52 am
Location: Lahti, Finland.

Re: Best practice character set encoding

Post by kaisellgren »

matthijs wrote:Yes indeed, expected. But not something I had thought about much before. Until now I knew I had to have the right headers send and set the right meta tag in my html. Put it in and be done with it, so to say. Everybody was using utf-8 so I did the same. But I didn't know that if you're going to use utf-8 (for real) in your code, a whole bunch of PHP functions don't work as expected anymore (that basically means you can't use them)
I think the general misundertanding is that people think the function strlen() is a "strcharcount()" for all encodings :). If you want get the char count, use either mb_ extension's equivalent or decode the encoding prior to passing it into the strlen(). With exotic encodings, you have to do the decoding yourself, but I recall a function like utf8decode() or something that you could use for UTF-8 decoding.
matthijs wrote:What I also find strange is that there's so little being talked about working with unicode in PHP.
I agree. I was actually going to make a deep insight into this in my blog.
matthijs wrote:I just checked a couple of PHP security books I have and the words "unicode" or "character_set" or "utf-8" are hardly mentioned (if at all).
If functions don't work as expected your code could have bugs and security breaches you weren't aware of.
There are some security issues related to encodings, but encoding related issues are usually bugs in applications that may or may not be easily fixable. Therefore, yes, it is a very good idea to fully understand encodings and character sets before working with them at all. In general, all functions in mb_string extension are not working "as expected" in their non mb_string counterpart versions.
matthijs wrote:At this point I think I'd better not try to work with utf-8 and just stick with latin-1 for the moment.
That's your decision. You suffer the consequences. Decision... consequence... bitc*! (Day Break)

Seriously, though, I would recommend you to read a book. You mentioned http://www.cs.tut.fi/~jkorpela/chars.html (oh it looks like you didn't, I swear someone did, maybe in other topic?), this guy, Jukka Korpela, which I know (he is a finn too :D), had also written a book: http://www.amazon.com/Unicode-Explained ... 980&sr=1-3

I recommend reading it!
matthijs wrote:Or am I missing something?
You are just being cautious and contemplative, which is good.
matthijs wrote:Yes. But it's the same when I go into my texteditor and type that string and save it (as utf-8)
The data in the database is not UTF-8. You copy the data from phpMyAdmin and paste it into Notepad and then you save it into UTF-8. That is the reason why it becames UTF-8, because the text editor did the translation.
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Re: Best practice character set encoding

Post by matthijs »

kaisellgren wrote:I agree. I was actually going to make a deep insight into this in my blog.
That would be interesting.
kaisellgren wrote:That's your decision. You suffer the consequences. Decision... consequence... bitc*! (Day Break)
Well at this point I haven't made any decision. However, reading some blog posts about all the trouble people are having trying to use utf-8 is not exactly encouraging.

Thanks for the link to that book. Definitely something worth looking into.
kaisellgren wrote:The data in the database is not UTF-8. You copy the data from phpMyAdmin and paste it into Notepad and then you save it into UTF-8. That is the reason why it becames UTF-8, because the text editor did the translation.
How do you know the data in the db is not UTF-8?

What happens if I have a .csv file, open it in my texteditor (Textmate), save it (as utf-8) and then import it in the database?
User avatar
kaisellgren
DevNet Resident
Posts: 1675
Joined: Sat Jan 07, 2006 5:52 am
Location: Lahti, Finland.

Re: Best practice character set encoding

Post by kaisellgren »

matthijs wrote:How do you know the data in the db is not UTF-8?
However, if this same string gets pulled out of the db into the webpage, if that webpage has
<meta http-equiv="content-type" content="text/html; charset=utf-8">
it turns out as J�ris
If you tell the web browser that the content-type is UTF-8, and the text still becomes "corrupted", then the reason is A) The data was not UTF-8 B) The OS/browser is imcapable of displaying the character(s).
matthijs wrote:What happens if I have a .csv file, open it in my texteditor (Textmate), save it (as utf-8) and then import it in the database?
If the import process does not convert anything, then the data will appear as UTF-8 in the database and therefore the above mentioned situation will be "fixed".
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Re: Best practice character set encoding

Post by matthijs »

kaisellgren wrote:If you tell the web browser that the content-type is UTF-8, and the text still becomes "corrupted", then the reason is A) The data was not UTF-8 B) The OS/browser is imcapable of displaying the character(s).
You might be right. I just checked with this code:

Code: Select all

 
<?php 
 
function is_utf8($string) {
  
    // From http://w3.org/International/questions/q ... utf-8.html
    return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs', $string);
  
}
 
// set header
header("Content-type: text/html; charset=utf-8");
//header("Content-Type: text/html; charset=ISO-8859-1");
 
// Pick one of these:
print('<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body>');
//print('<html><head><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></head><body>');
 
/* Connect to db */
mysql_connect( 'localhost', '***', '***' );
mysql_select_db( 'mydb' );
 
//mysql_query('SET NAMES utf8'); // optional, keep this disabled first
$res = mysql_query("SELECT `achternaam` FROM `leden3` WHERE `achternaam` LIKE 'Coll%' LIMIT 1");
if (!mysql_num_rows($res)) die('achternaam not found');
 
$row = mysql_fetch_row($res);
$s = $row[0];
 
print(bin2hex($s).'<br>'.$s.'<br>');
if(is_utf8($s)){
    echo 'The string is utf8<br>';
} else {
    echo 'The string is NOT utf8<br>';
}
 
If I run this code the result is:

Code: Select all

436f6c6ce965
Coll?e
The string is NOT utf8
 
Adding mysql_query('SET NAMES utf8'); does return the result I want

Code: Select all

436f6c6cc3a965
Collée
The string is utf8
But the weird thing is. If I get the original data file (csv). Copy paste the data in a new file. Save that as utf-8. Import it in the db (as utf-8). And then run the test above, I get exactly the same results. Can it happen that even though the data in the table is utf-8 and the display on the web page is also utf-8 (by the headers and meta tag), that still somewhere in between the data gets transformed into something else?
User avatar
kaisellgren
DevNet Resident
Posts: 1675
Joined: Sat Jan 07, 2006 5:52 am
Location: Lahti, Finland.

Re: Best practice character set encoding

Post by kaisellgren »

It's been a while when I last time worked on character sets and encodings, but I do remember that you need to tell MySQL to use UTF-8 (not just define the table as UTF-8). There are 2 commands/statements in MySQL that were something like SET NAMES and SET CHARACTER SET.
User avatar
kaisellgren
DevNet Resident
Posts: 1675
Joined: Sat Jan 07, 2006 5:52 am
Location: Lahti, Finland.

Re: Best practice character set encoding

Post by kaisellgren »

By the way, when I started rereading this topic I realized one thing. You were using a typical Latin character set and you are going to use UTF-8 now. UTF-8 is transparent to the core Latin character set and this means you need no conversions. Make sure you are using a UTF-8 connection (SET NAMES 'UTF-8') and UTF-8 collate.

Further more, send both header('content-type: text/html; charset: UTF-8') and meta tag UTF-8 encoding.

If PHP is predefined to run as non-UTF-8 charset (I mean it auto sends header() content-type - look at the ini), then do not worry, your header() call will replace any previous header content-type calls. :)
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Re: Best practice character set encoding

Post by matthijs »

Yes if I set
$dbAdapter->query('SET NAMES UTF8');
the data seems to come through correctly.

Thanks for all the input so far. It's a complicated subject. The basics are easy to understand, but the problem is it's not always transparent what is happening. As humans we only read "plain text" characters, we can't see code points.
Post Reply