Another thread on UTF-8 and other encodings...

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Another thread on UTF-8 and other encodings...

Post by alex.barylski »

I'm growing curious about multiple language support in my PHP applications...and I contacted AC who was helpful but suggested I start a new thread as this may be helpful to others as well...

As I've alwasy understood it, there were 3 character encoding's from a Windows development perspective...

1) ASCII which is SBCS (Single byte character set)
2) MBCS (Multi byte character set)
3) Unicode (double byte character set)

Although I never had much more an understanding than that, outside of using appropriate macros for conversion, etc...

I would just enter a string normally and wrap it in a macro which made the string unicode.

The difference between MBCS and Unicode is that Unicode always uses 2 bytes to represent a character whereas MBCS will use and/or two if nessecary...confusing I know...not to mention horribly awkward when using functions like sizeof() or strlen()

Atleast unicode you know to just divide in half to get the number of actual characters...

I'm curious though, what the heck is UTF-8? I thought it was Unicode...???

I want to support multiple language in my application, by storing all TEXT's inside a global under a language directory.

en/translation.inc
de/translation.inc
...

Including the language file as needed and just referencing the GLOBAL's inside my templates, etc...

Note: This isn't exactly how I store language packs, but just for the sake of argument assume this is best practice... :P

When I write on this keyboard...characters are entered in a 1:1 relationship with bytes of memory, but if I save my file, how would I make that file Unicode?

If I read a unicode file for say Chinese, inside notepad, I would get a bunch of gibberish, correct?

While inside a hex editor, I could more accurately see the encodings...however on a Chinese enabled desktop, I would see the proper symbols???

If that's how it works I'm starting to grasp the concept... :P

However string functions...do they work with Unicode?

If I used Unicode language packs (as I don't like the idea of MBCS) how does the browser know how to render them correctly?

Is that what the following HTML does?

Code: Select all

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> 

<html lang="zh-CN"> 
UTF-8 would need to be changed to Unicode???

1) What would be the advantages to using MBSC over Unicode, minus the obvious memory savings?

2) String functions would be more likely difficult to deal with using MBCS than Unicode, as Unicode is just fixed by dividing by half.

3) By setting the system locale, do string functions change the way they operate on strings, taking into consideration that some langauges use 2 bytes for characters AND one byte...

If every language used a different encoding scheme, that would be alot of code added to string functions, first the check the locale, characert encoding scheme and adjust counting, splicing, etc accordingly...so for me it makes sense to just use Unicode???

Thanks again Ambush ;)[/quote]
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Re: Another thread on UTF-8 and other encodings...

Post by feyd »

Hockey wrote:I'm curious though, what the heck is UTF-8? I thought it was Unicode...???
It is one encoding form of the unicode specification. There are many others ranging from single byte forms; multi-byte forms; two, three even four byte forms. All in various flavors.
Hockey wrote:When I write on this keyboard...characters are entered in a 1:1 relationship with bytes of memory, but if I save my file, how would I make that file Unicode?
Your editor chooses whether you are using single byte versus other encodings. The keyboard only sends the interupt signals for the keys.
Hockey wrote:If I read a unicode file for say Chinese, inside notepad, I would get a bunch of gibberish, correct?
Your version of notepad is written for single byte character sets. If memory serves the international versions of Windows have more robust versions of notepad that will display the native character sets of the region it is in.
Hockey wrote:While inside a hex editor, I could more accurately see the encodings...however on a Chinese enabled desktop, I would see the proper symbols???
yes.
Hockey wrote:However string functions...do they work with Unicode?
Technically they do alter the strings, but not correctly.
Hockey wrote:If I used Unicode language packs (as I don't like the idea of MBCS) how does the browser know how to render them correctly?
Based on the headers and other meta information your pages pass. If none is given it will attempt to guess. Sometimes this works, sometimes not.
Hockey wrote:UTF-8 would need to be changed to Unicode???
read above.
Hockey wrote:1) What would be the advantages to using MBSC over Unicode, minus the obvious memory savings?
Not needing to specially process certain characters.
Hockey wrote:2) String functions would be more likely difficult to deal with using MBCS than Unicode, as Unicode is just fixed by dividing by half.
The programming can be simpler for 16 bit characters. But if there are code pages, it is just as complicated.
Hockey wrote:3) By setting the system locale, do string functions change the way they operate on strings, taking into consideration that some langauges use 2 bytes for characters AND one byte...
Yes, there are adjustments made to how the functions work and understand byte strings.
Hockey wrote:If every language used a different encoding scheme, that would be alot of code added to string functions, first the check the locale, characert encoding scheme and adjust counting, splicing, etc accordingly...so for me it makes sense to just use Unicode???
Again, read above.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Another thread on UTF-8 and other encodings...

Post by alex.barylski »

Not needing to specially process certain characters
How does MBCS accomodate that?

I would figure it would require more special character intervention?

So could I use MBCS in my web pages or does it only support Unicode?
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

OK lets set this straight. This is a complex subject and it took me a couple of days to get my head round it so for the benefit of everyone else here's the score.

UTF-8 is the best encoding for anyone speaking a European language (yes, that includes American English). So as long as you aren't making sites in Japan you don't need to worry about UTF-16 and the others.

Personally I found the whole principle of Unicode and UTF-8 very difficult to understand until I read Wikipedia's UTF-8 article. Of course this depends on you knowing a bit about Computer Science and binary in particular. So go and read that article now. That's right stop reading this post for now and read that, go!

UTF-8 savvy now? If you didn't understand it you should probably try reading it again otherwise we'll have to leave that for now. OK next thing you need to read is this great post on sitepoint by Harry Fuecks. Once again, go read!

Hopefully things should be beginning to come clear. So here's a checklist of how to support UTF-8 sucessfully in PHP
  • Make sure all your source codes are saved as UTF-8 (not all editors will do this)
  • Make sure your server is sending a responce header with Content-Type saying the character encoding is UTF-8. You can do this in one of three ways:

    1. Specific to individual PHP files:

    Code: Select all

    header('Content-Type: text/html; charset=UTF-8');
    2. Specific to all PHP files:

    Code: Select all

    default_charset = "UTF-8"
    3 .Server or directory wide:

    Code: Select all

    AddDefaultCharset utf-8
    JavaScript natively supports Unicode so it may be beneficial (esp. if you are doing AJAX stuff) to use option 3.
  • In addition to those you should specify a meta tag:

    Code: Select all

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    In case the file is loaded without an HTTP interaction, like with file:// for instance, and if you use a form:

    Code: Select all

    <form action="." method="post" accept-encoding="UTF-8">
    Because some browsers may still want to send forms in a different encoding to the one the document was served in (Lord knows why).
  • Next you need to be aware of the limitations of the existing PHP functions. Sometimes they are OK to use othertimes they are not. Have a read of this and this. The reason that not all functions have a problem is that:
    Ambush Commander wrote:PHP has a nice feature where it treats all strings as binary. This means that as long as you are looking for specific characters, this will never be a problem: you will never confuse a byte a multibyte sequence with a full character.
    I am sure you can imagine that all the problems with the standard PHP functions and coming up with alternatives where you can't use them would be a major headache. The solution comes in the form of a library. PHP UTF-8 seems great and is what I am currently adapting to be OO. But even with this library you will benefit from knowing when a string function relys on multi-byte capabilities and when it does not to minimize the number of changes you have to make to your code and to maintain some performance.
  • Finally. Because of the way UTF-8 works a hacker (and screwy software) can send you "illformed" UTF-8. This can mess up any UTF-8 processing and can lead to security vulnerabilities. To solve this you must validate all input. The PHP UTF-8 library comes with an is_valid() function which you can use to test input is correctly form. Ambush Commander instead opted for an implementation which simply removes offending characters without reporting error which is equality acceptable (I'm pretty sure you can achieve that in PHP UTF-8 with a couple of extra calls but I haven't got to that stage yet).

    So you will need to perform these validations on all input.
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

One to add sorry. Set your database to UTF-8 as well :)
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

ole wrote:Make sure all your source codes are saved as UTF-8 (not all editors will do this)
By source codes I'm guessing you mean HTML? as I can't see how PHP source being saved in UTF-8 makes a difference?

AC also told me that UTF-8 is backwards compatible with ASCII...so using ASCII I get away with everything the way it has been for over a decade... :P

If I choose to add foriegn language support...say Chinese...I would only need to make sure the language files were UTF-8...

HTML is HTML and JS is JS...if you changed that too two byte character encodings...would that choke the parser???
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Hockey wrote:By source codes I'm guessing you mean HTML? as I can't see how PHP source being saved in UTF-8 makes a difference?
It does if you have typed UTF-8 characters such as Umlauts in your source code... they will be output as UTF-8... if you had the encoding set to iso-8859-1 they would be output as that, but render incorrectly where the browser thinks they're UTF-8.

Hockey wrote:AC also told me that UTF-8 is backwards compatible with ASCII...so using ASCII I get away with everything the way it has been for over a decade... :P
True. The characters at the lower end of the UTF- range are ascii bytes. Here's the break down.

Code: Select all

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
  
   // From http://w3.org/International/questions/q ... utf-8.html
   return preg_match('%^(?:
         [\x09\x0A\x0D\x20-\x7E]            # ASCII
       | [\xC2-\xDF][\x80-\xBF]            # non-overlong 2-byte
       |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
       | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
       |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
       |  \xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
       | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
       |  \xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
   )*$%xs', $string);
  
} // function is_utf8
Hockey wrote:If I choose to add foriegn language support...say Chinese...I would only need to make sure the language files were UTF-8...
Yes, if the encoding you are using is UTF, everyhting must match up. Just try saving a file with one encoding in your editor, then re-open it with a different encoding. Anything expect ascii will almost always turn to gobbledygook.
Hockey wrote:HTML is HTML and JS is JS...if you changed that too two byte character encodings...would that choke the parser???
If you use JS to genertae HTML it will be trying to send the character encoding used in the JS... that means if your page is set to iso-8859-1 they will display wrongly.

EDIT | I'm getting really bad for typos lately. I guess drinking beer all day long doesn't help.
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

I was just looking at http://www.sitepoint.com/blogs/2006/08/ ... de-slides/ again, I've read it through like 3 times already but still learning something new each time. If you read that you will see i've actually missed plenty of things out of my checklist.

Hockey: Read and digest all the links I've posted before discussion.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

It does if you have typed UTF-8 characters such as Umlauts in your source code... they will be output as UTF-8... if you had the encoding set to iso-8859-1 they would be output as that, but render incorrectly where the browser thinks they're UTF-8.
Thats what I thought...but if I stick to just plain PHP keywords, etc and language fields external...I shouldn't have a problem???
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

ole wrote:I was just looking at http://www.sitepoint.com/blogs/2006/08/ ... de-slides/ again, I've read it through like 3 times already but still learning something new each time. If you read that you will see i've actually missed plenty of things out of my checklist.

Hockey: Read and digest all the links I've posted before discussion.
I am :P
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Hockey wrote:Thats what I thought...but if I stick to just plain PHP keywords, etc and language fields external...I shouldn't have a problem???
No problem at all.
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Post by matthijs »

Ole, thanks for the write-up and great links. Had read a bit about it in the past but didn't completely understand it. A bit closer now.

Funny thing is I just found out one of my editers (HTML-kit) saves the files as ISO-8859-1. Now I understand why, when I save a file as HTML with a meta charset=UTF-8, some characters are displayed as ??. Therefore I used to encode every "strange" character with htmlentities like &#233; etc. But in fact I should set my editor to save the file as UTF-8, or change the meta charset to ISO-8859-1. In both situations the characters will display fine again.

Makes me think though. If a HTML page uses the UTF-8 meta tag. A client or other developer opens the file in an editor and saves it as ISO-8859-1, probably unknowingly. That could lead to problems displaying the correct characters. The other way around is also possible. An HTML file with an ISO-8859 meta tag saved as UTF-8 will display gibberish too.

Cool, now I understand why this happens better. I have had clients ask me about the gibberish characters in their documents (showing as webpages). Normally I would advise them to encode the characters manually with the htmlentities (&#233; etc). But in fact the correct saving setting in the text editor they open the file with in combination with the correct/identical meta charset tag would solve their problem as well.

Hmm, difficult...
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

Well you can htmlentity things, its actually pretty good advice because its always going to work. But there isn't an htmlentity for everything. So yes you should still serve your documents in UTF-8. I think Hockey came to the conclusion that if no special (non ASCII) characters appear in your source itself you don't need to save them as UTF-8 and ISO-8859-1 is fine but you must always serve them as UTF-8 (via meta and response headers) if that is the encoding you intend to use.

Let me explain specifically why those question marks appear. What is happening is that some how ISO-8859-1 characters are getting into your UTF-8 page or the other way round. This can happen if you have people entering any data into a database that isn't encoded the same as the pages being served or if the form for data entry is encoded differently to the pages which will ultimately display the data they are entering. Basically ensure your DB is UTF-8, ensure your pages are being served as UTF-8 and your doing well. Then you've got all that invalid character testing and UTF-8 specialist processing functions.
Hmm, difficult...
It really is.
Harry Fueck (roughly) wrote:A lot of the complexities of this may be reduced when PHP 6 is released but the issues we are faced with now aren't going to magically vanish
...that gave me a bit of hope, I was worring that doing all this complex stuff would be a bit of a waste of time if PHP 6 was going to make it all unnecessary. Also if I get all this UTF-8 stuff right I can take pride in the fact that I've joined a rather small club of people that can do it.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

Hockey, here's another word of advise: stop using the term 'Unicode'. As a programmer, you're always going to use some of the unicode encodings, such as utf-*, ucs2, etc, not 'unicode' in general. So use specific encoding names. It will help you to not confuse yourself.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

Weirdan wrote:Hockey, here's another word of advise: stop using the term 'Unicode'. As a programmer, you're always going to use some of the unicode encodings, such as utf-*, ucs2, etc, not 'unicode' in general. So use specific encoding names. It will help you to not confuse yourself.
I'm coming to that conclusion slowly :P

Unicode is just a generalization...for an entire collection of multi language encodings...

What confuses me is why the need for UTF-16 when UTF-8 as I understand it using the 8th bit (allowing 256 characters) would suffice all languages? Or does UTF-16 come in for Asian languages where the alphabet contains more than 256 characters?

I'm not sure I quite understand either how UTF-8 is backwards compatible with UTF-16 or rather the other way around, but yeah...???

So MBCS is *not* part of the Unicode specification...??? How would one use that instead of Unicode???
Post Reply