Multilanguage Best Practices / UTF8
Moderator: General Moderators
Multilanguage Best Practices / UTF8
I'm designing a system that currently will be for Western based character sets but will soon be expanding into Russian and other type of characters.
I was wondering what practices there are that will best future proof your code and data.
I’ve identified the following places where character sets often change, and where there can be problems in communicating
php - server to browser
php - browser to server
browser -server to browser to flash
browser - flash to browser to server
xml - server to browser
database - server to php
database - php to server
After doing some research I’ve identified UTF8 as the best character set to use.
To implement this I’m going to do the following:
only serving php files with the following meta tag
“meta http-equiv="Content-Type" content="text/html; charset=utf-8"”
only serving xml files with the
“xml version="1.0" encoding="UTF-8"”
creating mysql tables with charset as utf8
Besides using these tags what else can we do to ensure that all encoding is done in utf-8?
Will all data be transferred in UTF8 or should my php scripts do some other formatting?
And are there any short falls for using UTF8 as my base character set?
I was wondering what practices there are that will best future proof your code and data.
I’ve identified the following places where character sets often change, and where there can be problems in communicating
php - server to browser
php - browser to server
browser -server to browser to flash
browser - flash to browser to server
xml - server to browser
database - server to php
database - php to server
After doing some research I’ve identified UTF8 as the best character set to use.
To implement this I’m going to do the following:
only serving php files with the following meta tag
“meta http-equiv="Content-Type" content="text/html; charset=utf-8"”
only serving xml files with the
“xml version="1.0" encoding="UTF-8"”
creating mysql tables with charset as utf8
Besides using these tags what else can we do to ensure that all encoding is done in utf-8?
Will all data be transferred in UTF8 or should my php scripts do some other formatting?
And are there any short falls for using UTF8 as my base character set?
specify accept-charset attribute for your forms:
http://www.w3.org/TR/html401/interact/f ... pt-charset
set proper connection charset when connecting to the db server (some rdbms allow the connection charset to be different from database encoding)
use proper collation (http://dev.mysql.com/doc/refman/5.0/en/ ... -sets.html)
http://www.w3.org/TR/html401/interact/f ... pt-charset
set proper connection charset when connecting to the db server (some rdbms allow the connection charset to be different from database encoding)
use proper collation (http://dev.mysql.com/doc/refman/5.0/en/ ... -sets.html)
I've just read through one of those links and am a bit confused by this
Does this mean I should use
meta http-equiv="Content-Type" content="text/html; charset=utf-8"
or
meta http-equiv="Content-Type" content="text/html; charset=ISO10646"
and
form accept-charset="utf-8"
or
form accept-charset="ISO10646"
I found a list of char sets at http://www.iana.org/assignments/character-sets but am not sure which one to use? I want the widest char set, so that I don't ever have to change it
Code: Select all
The character set defined in [ISO10646] is character-by-character equivalent to Unicode ([UNICODE]). Both of these standards are updated from time to time with new characters, and the amendments should be consulted at the respective Web sites. In the current specification, "[ISO10646]" is used to refer to the document character set while "[UNICODE]" is reserved for references to the Unicode bidirectional text algorithm.meta http-equiv="Content-Type" content="text/html; charset=utf-8"
or
meta http-equiv="Content-Type" content="text/html; charset=ISO10646"
and
form accept-charset="utf-8"
or
form accept-charset="ISO10646"
I found a list of char sets at http://www.iana.org/assignments/character-sets but am not sure which one to use? I want the widest char set, so that I don't ever have to change it
- Maugrim_The_Reaper
- DevNet Master
- Posts: 2704
- Joined: Tue Nov 02, 2004 5:43 am
- Location: Ireland
I know this may be incredibly obvious - but I often come across it on the internet (and believe me it happens, my name contains a non-english character which makes it obvious when used).
If you have HTML files, CSS, etc. Make sure it is actually encoded in UTF-8. Its not enough to use the headers - the actual file must be saved using the UTF-8 character encoding. Most editors should support this by now so its easy to check through configuration options, etc. to be certain.
The effects of a UTF-8 header/charset in a file not actually UTF-8 encoded is for the client browser to get confused and start outputting ???? in place of some characters. Like I said, its probably obvious - but just in case.
Hotmail for example is incapable of displaying my name correctly...
If you have HTML files, CSS, etc. Make sure it is actually encoded in UTF-8. Its not enough to use the headers - the actual file must be saved using the UTF-8 character encoding. Most editors should support this by now so its easy to check through configuration options, etc. to be certain.
The effects of a UTF-8 header/charset in a file not actually UTF-8 encoded is for the client browser to get confused and start outputting ???? in place of some characters. Like I said, its probably obvious - but just in case.
Hotmail for example is incapable of displaying my name correctly...
also htmlentities() will kill your european characters
use
after a while it may get cumbersome to type that out so you could create a function called utf8entities() or something
use
Code: Select all
htmlentities($string,ENT_QUOTES,'UTF-8')after a while it may get cumbersome to type that out so you could create a function called utf8entities() or something
If you wouldn't mind telling me: how does it kill then? Does it strip them out, or does it not convert them.
I found that html_entities doesn't convert some Russian characters that some of my users have entered. For example о is how I receive data entered into a normal text field (I've done no conversion on it, I think the browser does it), html_entity_decode() does not convert it back to the Russian character. I'm guessing this is because I didn’t specify the UTF-8 charset?
I found that html_entities doesn't convert some Russian characters that some of my users have entered. For example о is how I receive data entered into a normal text field (I've done no conversion on it, I think the browser does it), html_entity_decode() does not convert it back to the Russian character. I'm guessing this is because I didn’t specify the UTF-8 charset?
Seems that php from v 6 will have Unicode encoding built in. It's always sods law that what you want is always in the next version
http://www.zend.com/zend/week/php-unicode-design.txt
Is an interesting article on the new unicode support
http://www.zend.com/zend/week/php-unicode-design.txt
Is an interesting article on the new unicode support
- Maugrim_The_Reaper
- DevNet Master
- Posts: 2704
- Joined: Tue Nov 02, 2004 5:43 am
- Location: Ireland
An example of why character encoding matters...hey if it works on Google...
http://shiflett.org/archive/177
http://shiflett.org/archive/177
- CoderGoblin
- DevNet Resident
- Posts: 1425
- Joined: Tue Mar 16, 2004 10:03 am
- Location: Aachen, Germany