Page 1 of 2
Multilanguage Best Practices / UTF8
Posted: Wed Jan 18, 2006 7:08 am
by jaycangel
I'm designing a system that currently will be for Western based character sets but will soon be expanding into Russian and other type of characters.
I was wondering what practices there are that will best future proof your code and data.
I’ve identified the following places where character sets often change, and where there can be problems in communicating
php - server to browser
php - browser to server
browser -server to browser to flash
browser - flash to browser to server
xml - server to browser
database - server to php
database - php to server
After doing some research I’ve identified UTF8 as the best character set to use.
To implement this I’m going to do the following:
only serving php files with the following meta tag
“meta http-equiv="Content-Type" content="text/html; charset=utf-8"”
only serving xml files with the
“xml version="1.0" encoding="UTF-8"”
creating mysql tables with charset as utf8
Besides using these tags what else can we do to ensure that all encoding is done in utf-8?
Will all data be transferred in UTF8 or should my php scripts do some other formatting?
And are there any short falls for using UTF8 as my base character set?
Posted: Wed Jan 18, 2006 10:03 am
by Weirdan
specify accept-charset attribute for your forms:
http://www.w3.org/TR/html401/interact/f ... pt-charset
set proper connection charset when connecting to the db server (some rdbms allow the connection charset to be different from database encoding)
use proper collation (
http://dev.mysql.com/doc/refman/5.0/en/ ... -sets.html)
Posted: Wed Jan 18, 2006 10:09 am
by jaycangel
Thanks, I wasn't aware that you could put a char set on a form. I had wondered how you would specify what char set data was sent to the server. I thought it would take it from the HTML Meta tag.
Posted: Wed Jan 18, 2006 10:20 am
by jaycangel
I've just read through one of those links and am a bit confused by this
Code: Select all
The character set defined in [ISO10646] is character-by-character equivalent to Unicode ([UNICODE]). Both of these standards are updated from time to time with new characters, and the amendments should be consulted at the respective Web sites. In the current specification, "[ISO10646]" is used to refer to the document character set while "[UNICODE]" is reserved for references to the Unicode bidirectional text algorithm.
Does this mean I should use
meta http-equiv="Content-Type" content="text/html; charset=utf-8"
or
meta http-equiv="Content-Type" content="text/html; charset=ISO10646"
and
form accept-charset="utf-8"
or
form accept-charset="ISO10646"
I found a list of char sets at
http://www.iana.org/assignments/character-sets but am not sure which one to use? I want the widest char set, so that I don't ever have to change it
Posted: Wed Jan 18, 2006 10:40 am
by jaycangel
So far this is what i've found out through my research
Header:
Content-Type: text/html; charset=UTF-8
Tag in xHTML page
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
Tag in xml document
<?xml version="1.0" encoding="UTF-8"?>
attribute in form tag
accept-charset="utf-8"
[/b]
Posted: Thu Jan 19, 2006 5:52 am
by Maugrim_The_Reaper
I know this may be incredibly obvious - but I often come across it on the internet (and believe me it happens, my name contains a non-english character which makes it obvious when used).
If you have HTML files, CSS, etc. Make sure it is actually encoded in UTF-8. Its not enough to use the headers - the actual file must be saved using the UTF-8 character encoding. Most editors should support this by now so its easy to check through configuration options, etc. to be certain.
The effects of a UTF-8 header/charset in a file not actually UTF-8 encoded is for the client browser to get confused and start outputting ???? in place of some characters. Like I said, its probably obvious - but just in case.
Hotmail for example is incapable of displaying my name correctly...

Posted: Thu Jan 19, 2006 6:16 am
by jaycangel
cool thanks, i hadn't checked what my files were saved in!
Posted: Thu Jan 19, 2006 11:44 pm
by josh
also htmlentities() will kill your european characters
use
Code: Select all
htmlentities($string,ENT_QUOTES,'UTF-8')
after a while it may get cumbersome to type that out so you could create a function called utf8entities() or something
Posted: Fri Jan 20, 2006 5:02 am
by jaycangel
If you wouldn't mind telling me: how does it kill then? Does it strip them out, or does it not convert them.
I found that html_entities doesn't convert some Russian characters that some of my users have entered. For example о is how I receive data entered into a normal text field (I've done no conversion on it, I think the browser does it), html_entity_decode() does not convert it back to the Russian character. I'm guessing this is because I didn’t specify the UTF-8 charset?
Posted: Fri Jan 20, 2006 5:57 am
by josh
if you htmlentities() a european character without specifying the char set it will convert the characters to random strings (well seemingly random)
Posted: Fri Jan 20, 2006 6:17 am
by jaycangel
Seems that php from v 6 will have Unicode encoding built in. It's always sods law that what you want is always in the next version
http://www.zend.com/zend/week/php-unicode-design.txt
Is an interesting article on the new unicode support
Posted: Fri Jan 20, 2006 10:42 am
by Maugrim_The_Reaper
An example of why character encoding matters...hey if it works on Google...
http://shiflett.org/archive/177
Posted: Mon Oct 30, 2006 3:00 am
by kamel
My problem is that i work with htmlentities and "UTF-8" like third param, but in the same way the counterpart function html_entity_decode doesn't work and php throw an error message like this: cannot yet handle MBCS in html_entity_decode()! in: bla bla bla
Who can help me about ?
Thanks in advance.
Posted: Mon Oct 30, 2006 7:01 am
by CoderGoblin
Be careful with things like substr, you may need to look at mb_substr...
Something I am just starting to look at as I am currently checking an error I think is caused by it myself...
Posted: Mon Oct 30, 2006 8:11 am
by kamel
The main problem is that i don't have mbstring on the production server and the html_entity_decode seem to
doesn't work becasuse php throw an error message when i use utf-8 like third param.
Can you give me any advice ?