Another thread on UTF-8 and other encodings...
Posted: Sat Aug 26, 2006 3:57 pm
I'm growing curious about multiple language support in my PHP applications...and I contacted AC who was helpful but suggested I start a new thread as this may be helpful to others as well...
As I've alwasy understood it, there were 3 character encoding's from a Windows development perspective...
1) ASCII which is SBCS (Single byte character set)
2) MBCS (Multi byte character set)
3) Unicode (double byte character set)
Although I never had much more an understanding than that, outside of using appropriate macros for conversion, etc...
I would just enter a string normally and wrap it in a macro which made the string unicode.
The difference between MBCS and Unicode is that Unicode always uses 2 bytes to represent a character whereas MBCS will use and/or two if nessecary...confusing I know...not to mention horribly awkward when using functions like sizeof() or strlen()
Atleast unicode you know to just divide in half to get the number of actual characters...
I'm curious though, what the heck is UTF-8? I thought it was Unicode...???
I want to support multiple language in my application, by storing all TEXT's inside a global under a language directory.
en/translation.inc
de/translation.inc
...
Including the language file as needed and just referencing the GLOBAL's inside my templates, etc...
Note: This isn't exactly how I store language packs, but just for the sake of argument assume this is best practice...
When I write on this keyboard...characters are entered in a 1:1 relationship with bytes of memory, but if I save my file, how would I make that file Unicode?
If I read a unicode file for say Chinese, inside notepad, I would get a bunch of gibberish, correct?
While inside a hex editor, I could more accurately see the encodings...however on a Chinese enabled desktop, I would see the proper symbols???
If that's how it works I'm starting to grasp the concept...
However string functions...do they work with Unicode?
If I used Unicode language packs (as I don't like the idea of MBCS) how does the browser know how to render them correctly?
Is that what the following HTML does?
UTF-8 would need to be changed to Unicode???
1) What would be the advantages to using MBSC over Unicode, minus the obvious memory savings?
2) String functions would be more likely difficult to deal with using MBCS than Unicode, as Unicode is just fixed by dividing by half.
3) By setting the system locale, do string functions change the way they operate on strings, taking into consideration that some langauges use 2 bytes for characters AND one byte...
If every language used a different encoding scheme, that would be alot of code added to string functions, first the check the locale, characert encoding scheme and adjust counting, splicing, etc accordingly...so for me it makes sense to just use Unicode???
Thanks again Ambush
[/quote]
As I've alwasy understood it, there were 3 character encoding's from a Windows development perspective...
1) ASCII which is SBCS (Single byte character set)
2) MBCS (Multi byte character set)
3) Unicode (double byte character set)
Although I never had much more an understanding than that, outside of using appropriate macros for conversion, etc...
I would just enter a string normally and wrap it in a macro which made the string unicode.
The difference between MBCS and Unicode is that Unicode always uses 2 bytes to represent a character whereas MBCS will use and/or two if nessecary...confusing I know...not to mention horribly awkward when using functions like sizeof() or strlen()
Atleast unicode you know to just divide in half to get the number of actual characters...
I'm curious though, what the heck is UTF-8? I thought it was Unicode...???
I want to support multiple language in my application, by storing all TEXT's inside a global under a language directory.
en/translation.inc
de/translation.inc
...
Including the language file as needed and just referencing the GLOBAL's inside my templates, etc...
Note: This isn't exactly how I store language packs, but just for the sake of argument assume this is best practice...
When I write on this keyboard...characters are entered in a 1:1 relationship with bytes of memory, but if I save my file, how would I make that file Unicode?
If I read a unicode file for say Chinese, inside notepad, I would get a bunch of gibberish, correct?
While inside a hex editor, I could more accurately see the encodings...however on a Chinese enabled desktop, I would see the proper symbols???
If that's how it works I'm starting to grasp the concept...
However string functions...do they work with Unicode?
If I used Unicode language packs (as I don't like the idea of MBCS) how does the browser know how to render them correctly?
Is that what the following HTML does?
Code: Select all
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<html lang="zh-CN">
1) What would be the advantages to using MBSC over Unicode, minus the obvious memory savings?
2) String functions would be more likely difficult to deal with using MBCS than Unicode, as Unicode is just fixed by dividing by half.
3) By setting the system locale, do string functions change the way they operate on strings, taking into consideration that some langauges use 2 bytes for characters AND one byte...
If every language used a different encoding scheme, that would be alot of code added to string functions, first the check the locale, characert encoding scheme and adjust counting, splicing, etc accordingly...so for me it makes sense to just use Unicode???
Thanks again Ambush