Character encoding and localization

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Character encoding and localization

Post by alex.barylski »

Character encodings. I understand them for the most part. ASCII sucks cause it's limited to 256 bytes per character which will not meet languages like Japanese which might require up to 6 bytes per character. For this reason other multie byte character encodings have been introduced (ie: Unicode, UTF-8, etc???).

Most of all of these reserve the range 0-127 to accomodate ASCII and allow backwards compatibility.

Do I dictate the character encoding a browser uses with the charset attribute in the <html> tag?

I'm going to read this shortly: http://www.w3.org/TR/REC-html40/charset.html

My concern is, how does the PHP script know which character set to assume when the data is POSTed in some strange encoding? I mean wouldn't a clash of charsets cause weird issues and scrambling of data? Sorting, etc???

If a browser is set to a different charset and their native langauge is based on Cyrillic.

When they enter the name "Frank" in their native language and the PHP script assumes UTF-8 what or how does that data convert on the server side? I assume PHP offers charset conversion but if you can force the character set then you might just assume UTF-8, correct?

My problem is based mostly on the fact I am trying to acommodate an international market all at the same time and understand the problems involved so I may address them.

Should I allow my users to select their character set as part of their user profile date, much like language and locale? Then they would submit data tothe PHP script which would check their charset, convert to whatever is used in the server side (UTF-8 : MySQL tables, etc) and convert again when sending back to screen, according to user charset? You can retreive the charset from the Browser automagically I would assume but is the idea the same, regarding conversion on the backend, or is this done for you by PHP when you properly initialize the system for internationalization?
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Character encoding and localization

Post by alex.barylski »

Turns out that Thaland uses the Buddist Calendar which apparentlys has 13 months on leap years...

My interface uses a (what I thought) standard month/day/year drop downs for selecting dates...I guess that interfce wouldn't port over to Buddist calendar very well.

I wonder if JS date pickers are smart enough to meet those kind of cultural caveats?

Perhaps I should allow manual entry via a single TEXT field...but then how do I convert into a timestamp (if even possible) and format for re-display to someone in the Western world?
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Character encoding and localization

Post by Eran »

I think you are overcomplicating things - UTF-8 covers most languages (at least, those relevant for digital information...), you can use it across the board and not worry about 'strange' encodings. If you serve your HTML with UTF headers, your forms will be submitted as UTF (unless someone manually switches encoding - but then they shouldn't be able to read your site properly), and you can store that data as UTF in your database.

You can also use mb_detect_encoding() to check input encoding, and convert to utf if it differs from it.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Character encoding and localization

Post by alex.barylski »

Everything is complex when you don't understand it...I shamelessly admit...I have very limited understanding on what it takes to Internationlize an application...

I just finished reading a few more articles and it really helped me understand some concepts better.

UTF-8 is a character encoding much like a compression or encryption encoding
Unicode is the character set which assigns unique code points to every character of every languauge

When I made those two realizations (which I assume are correct) it really helped, especially the former.

I will use UTF-8 as it's what makes most sense being an English speaker trying to support other langugaes. As I understand technically speaking it's identical to ASCII un util code point 127, which means that native text is stored as single byte per character. Seeing as I plan to only support local clients first, this makes sense.

What thing I don't understand though, is how the code points are stored.

I mean if U+0041 is code point 65 which is ASCII A I assume that code point format is strictly convention?

If I passed a UTF-8 function a string of data like:

0x65 0x66 0x67

I should be returned ABC

How does UTF-8 know when to use multiple bytes though, once past code point 127? I suppose it would work similar to Huffman compression...but man it's been a looooong while since I studied compression algorithms and I've practically forgotten everything I understood so clearly 10 years ago. :P

Basically I am curious to know how the encoding works at a low level...given a string of bytes...

I know that if a byte is greater than 127 that indicates a multiple byte code point but how then does the encoder determine the number of byte to consume for a given code point?
All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.
Zing! There is the answer I was looking for :)
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Character encoding and localization

Post by alex.barylski »

I still have the basic question:

What happens if someone posts in UCS-2/UTF-16 but my server operates on UTF-8 -- how do I determine/convert from one set to another before using the data?

EDIT: http://ca.php.net/manual/en/function.iconv.php
Post Reply