Multilanguage Best Practices / UTF8

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

jaycangel
Forum Newbie
Posts: 7
Joined: Wed Jan 18, 2006 6:46 am
Location: London

Multilanguage Best Practices / UTF8

Post by jaycangel »

I'm designing a system that currently will be for Western based character sets but will soon be expanding into Russian and other type of characters.

I was wondering what practices there are that will best future proof your code and data.

I’ve identified the following places where character sets often change, and where there can be problems in communicating

php - server to browser
php - browser to server
browser -server to browser to flash
browser - flash to browser to server
xml - server to browser
database - server to php
database - php to server

After doing some research I’ve identified UTF8 as the best character set to use.

To implement this I’m going to do the following:

only serving php files with the following meta tag
“meta http-equiv="Content-Type" content="text/html; charset=utf-8"”

only serving xml files with the
“xml version="1.0" encoding="UTF-8"”

creating mysql tables with charset as utf8

Besides using these tags what else can we do to ensure that all encoding is done in utf-8?

Will all data be transferred in UTF8 or should my php scripts do some other formatting?

And are there any short falls for using UTF8 as my base character set?
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

specify accept-charset attribute for your forms:
http://www.w3.org/TR/html401/interact/f ... pt-charset

set proper connection charset when connecting to the db server (some rdbms allow the connection charset to be different from database encoding)

use proper collation (http://dev.mysql.com/doc/refman/5.0/en/ ... -sets.html)
jaycangel
Forum Newbie
Posts: 7
Joined: Wed Jan 18, 2006 6:46 am
Location: London

Post by jaycangel »

Thanks, I wasn't aware that you could put a char set on a form. I had wondered how you would specify what char set data was sent to the server. I thought it would take it from the HTML Meta tag.
jaycangel
Forum Newbie
Posts: 7
Joined: Wed Jan 18, 2006 6:46 am
Location: London

Post by jaycangel »

I've just read through one of those links and am a bit confused by this

Code: Select all

The character set defined in [ISO10646] is character-by-character equivalent to Unicode ([UNICODE]). Both of these standards are updated from time to time with new characters, and the amendments should be consulted at the respective Web sites. In the current specification, "[ISO10646]" is used to refer to the document character set while "[UNICODE]" is reserved for references to the Unicode bidirectional text algorithm.
Does this mean I should use

meta http-equiv="Content-Type" content="text/html; charset=utf-8"
or
meta http-equiv="Content-Type" content="text/html; charset=ISO10646"

and

form accept-charset="utf-8"
or
form accept-charset="ISO10646"

I found a list of char sets at http://www.iana.org/assignments/character-sets but am not sure which one to use? I want the widest char set, so that I don't ever have to change it
jaycangel
Forum Newbie
Posts: 7
Joined: Wed Jan 18, 2006 6:46 am
Location: London

Post by jaycangel »

So far this is what i've found out through my research

Header:
Content-Type: text/html; charset=UTF-8
Tag in xHTML page
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
Tag in xml document
<?xml version="1.0" encoding="UTF-8"?>
attribute in form tag
accept-charset="utf-8"
[/b]
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

I know this may be incredibly obvious - but I often come across it on the internet (and believe me it happens, my name contains a non-english character which makes it obvious when used).

If you have HTML files, CSS, etc. Make sure it is actually encoded in UTF-8. Its not enough to use the headers - the actual file must be saved using the UTF-8 character encoding. Most editors should support this by now so its easy to check through configuration options, etc. to be certain.

The effects of a UTF-8 header/charset in a file not actually UTF-8 encoded is for the client browser to get confused and start outputting ???? in place of some characters. Like I said, its probably obvious - but just in case.

Hotmail for example is incapable of displaying my name correctly...;)
jaycangel
Forum Newbie
Posts: 7
Joined: Wed Jan 18, 2006 6:46 am
Location: London

Post by jaycangel »

cool thanks, i hadn't checked what my files were saved in!
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Post by josh »

also htmlentities() will kill your european characters

use

Code: Select all

htmlentities($string,ENT_QUOTES,'UTF-8')

after a while it may get cumbersome to type that out so you could create a function called utf8entities() or something
jaycangel
Forum Newbie
Posts: 7
Joined: Wed Jan 18, 2006 6:46 am
Location: London

Post by jaycangel »

If you wouldn't mind telling me: how does it kill then? Does it strip them out, or does it not convert them.

I found that html_entities doesn't convert some Russian characters that some of my users have entered. For example о is how I receive data entered into a normal text field (I've done no conversion on it, I think the browser does it), html_entity_decode() does not convert it back to the Russian character. I'm guessing this is because I didn’t specify the UTF-8 charset?
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Post by josh »

if you htmlentities() a european character without specifying the char set it will convert the characters to random strings (well seemingly random)
jaycangel
Forum Newbie
Posts: 7
Joined: Wed Jan 18, 2006 6:46 am
Location: London

Post by jaycangel »

Seems that php from v 6 will have Unicode encoding built in. It's always sods law that what you want is always in the next version

http://www.zend.com/zend/week/php-unicode-design.txt

Is an interesting article on the new unicode support
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

An example of why character encoding matters...hey if it works on Google...;)

http://shiflett.org/archive/177
kamel
Forum Newbie
Posts: 7
Joined: Mon Jul 31, 2006 3:33 am

Post by kamel »

My problem is that i work with htmlentities and "UTF-8" like third param, but in the same way the counterpart function html_entity_decode doesn't work and php throw an error message like this: cannot yet handle MBCS in html_entity_decode()! in: bla bla bla
Who can help me about ?
Thanks in advance.
User avatar
CoderGoblin
DevNet Resident
Posts: 1425
Joined: Tue Mar 16, 2004 10:03 am
Location: Aachen, Germany

Post by CoderGoblin »

Be careful with things like substr, you may need to look at mb_substr...

Something I am just starting to look at as I am currently checking an error I think is caused by it myself...
kamel
Forum Newbie
Posts: 7
Joined: Mon Jul 31, 2006 3:33 am

Post by kamel »

The main problem is that i don't have mbstring on the production server and the html_entity_decode seem to
doesn't work becasuse php throw an error message when i use utf-8 like third param.
Can you give me any advice ?
Post Reply