Character encoding and localization
Posted: Sat Nov 08, 2008 7:12 pm
Character encodings. I understand them for the most part. ASCII sucks cause it's limited to 256 bytes per character which will not meet languages like Japanese which might require up to 6 bytes per character. For this reason other multie byte character encodings have been introduced (ie: Unicode, UTF-8, etc???).
Most of all of these reserve the range 0-127 to accomodate ASCII and allow backwards compatibility.
Do I dictate the character encoding a browser uses with the charset attribute in the <html> tag?
I'm going to read this shortly: http://www.w3.org/TR/REC-html40/charset.html
My concern is, how does the PHP script know which character set to assume when the data is POSTed in some strange encoding? I mean wouldn't a clash of charsets cause weird issues and scrambling of data? Sorting, etc???
If a browser is set to a different charset and their native langauge is based on Cyrillic.
When they enter the name "Frank" in their native language and the PHP script assumes UTF-8 what or how does that data convert on the server side? I assume PHP offers charset conversion but if you can force the character set then you might just assume UTF-8, correct?
My problem is based mostly on the fact I am trying to acommodate an international market all at the same time and understand the problems involved so I may address them.
Should I allow my users to select their character set as part of their user profile date, much like language and locale? Then they would submit data tothe PHP script which would check their charset, convert to whatever is used in the server side (UTF-8 : MySQL tables, etc) and convert again when sending back to screen, according to user charset? You can retreive the charset from the Browser automagically I would assume but is the idea the same, regarding conversion on the backend, or is this done for you by PHP when you properly initialize the system for internationalization?
Most of all of these reserve the range 0-127 to accomodate ASCII and allow backwards compatibility.
Do I dictate the character encoding a browser uses with the charset attribute in the <html> tag?
I'm going to read this shortly: http://www.w3.org/TR/REC-html40/charset.html
My concern is, how does the PHP script know which character set to assume when the data is POSTed in some strange encoding? I mean wouldn't a clash of charsets cause weird issues and scrambling of data? Sorting, etc???
If a browser is set to a different charset and their native langauge is based on Cyrillic.
When they enter the name "Frank" in their native language and the PHP script assumes UTF-8 what or how does that data convert on the server side? I assume PHP offers charset conversion but if you can force the character set then you might just assume UTF-8, correct?
My problem is based mostly on the fact I am trying to acommodate an international market all at the same time and understand the problems involved so I may address them.
Should I allow my users to select their character set as part of their user profile date, much like language and locale? Then they would submit data tothe PHP script which would check their charset, convert to whatever is used in the server side (UTF-8 : MySQL tables, etc) and convert again when sending back to screen, according to user charset? You can retreive the charset from the Browser automagically I would assume but is the idea the same, regarding conversion on the backend, or is this done for you by PHP when you properly initialize the system for internationalization?