UTF-8 isn't a single byte encoding. Both UTF-8 and UTF-16 may use up to 4 bytes per character. The difference between them (here I simplify to ease the comprehension) is that ASCII characters in UTF-8 are represented using 1 byte while in UTF-16 they require 2 bytes. And 256 character would not suffice all languages, even if you don't expect characters from different alphabets to be mixed in single string.when UTF-8 as I understand it using the 8th bit (allowing 256 characters) would suffice all languages?
Another thread on UTF-8 and other encodings...
Moderator: General Moderators
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
What???? Now i'm formally confused...so if UTF 8 and 16 *may* use up to 4 bytes per character...UTF-8 isn't a single byte encoding. Both UTF-8 and UTF-16 may use up to 4 bytes per character
WTF...how is it decided? Who determines what number of bytes they use? Why would you use 4 bytes/character - obviously to give a MUCH greater range of possible characters, but still...
I save my language files as UTF-8...using an ASCII keyboard...the file is saved as single byte characters...but is backwards compatible with UTF-8 as I understand it...
How would I get UTF-8 to use more than one byte per character???
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
UTF-8 uses one to four bytes (strictly, octets) per character, depending on the Unicode symbol. Only one byte is needed to encode the 128 US-ASCII characters (Unicode range U+0000 to U+007F). Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF). Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in other planes of Unicode.
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
The windows XP notepad is UTF-8 aware (well, at least for the ones I have seen), that is, it can read and write UTF-8 encoded text files (and render most characters correctly). Most modern text editors are UTF-8 aware, however, not all will display other encodings and some doesn't render all the characters available in UTF-8 (they just show them as boxes []).
Another thing you may encounter is the Byte-Order-Mark (BOM), for PHP, the BOM is ignored, thus you should not save your text files with BOM.
To display the characters correctly of different encodings, you need the specific code pages as well. The editor which happens to have most of the useful code pages is the emeditor.
The Chinese, Japanese and Korean (CJK) encode their languages with different encodings due to political issues (yes, you get politics in character encodings). Further more, the Chinese and Japanese have 3 different encodings each (again, probably due to politics but not sure). Fo example, the GB2312 encoding is adopted by the PRC, GB5 encoding is usually found in HK and Taiwan, etc.
Another thing you may encounter is the Byte-Order-Mark (BOM), for PHP, the BOM is ignored, thus you should not save your text files with BOM.
To display the characters correctly of different encodings, you need the specific code pages as well. The editor which happens to have most of the useful code pages is the emeditor.
In addition, it can actually convert to and from different encodings correctly (so far). That is, if someone sends you Chinese encoded using GB2312 (which is the main encodings used in PRC), you can open this in EmEditor and then save it as UTF-8 correctly.EmEditor supports Unicode little endian, Unicode big endian, UTF-8, UTF-7, Baltic, Central European, Chinese Simplified, Chinese Traditional, Cyrillic, Greek, Japanese (Shift-JIS), Japanese (JIS), Japanese (EUC), Korean, Thai, Turkish, Vietnamese, Western European, and all other encodings available in Windows.
The Chinese, Japanese and Korean (CJK) encode their languages with different encodings due to political issues (yes, you get politics in character encodings). Further more, the Chinese and Japanese have 3 different encodings each (again, probably due to politics but not sure). Fo example, the GB2312 encoding is adopted by the PRC, GB5 encoding is usually found in HK and Taiwan, etc.
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
PHP i18n charsets wrote:When editing content outside of a (decent) browser, make sure to use an editor with UTF-8 support (i.e. not notepad!)
wei wrote:Another thing you may encounter is the Byte-Order-Mark (BOM), for PHP, the BOM is ignored, thus you should not save your text files with BOM.
In [url=http://en.wikipedia.org/wiki/Byte_Order_Mark]an article about BOM[/url], Wikipedia wrote:and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script
That's interesting but you know what the solution is...Unicode!wei wrote:The Chinese, Japanese and Korean (CJK) encode their languages with different encodings due to political issues (yes, you get politics in character encodings). Further more, the Chinese and Japanese have 3 different encodings each (again, probably due to politics but not sure). Fo example, the GB2312 encoding is adopted by the PRC, GB5 encoding is usually found in HK and Taiwan, etc.
There are characters within CJK that are not represented in unicode.
http://en.wikipedia.org/wiki/Han_unification
http://en.wikipedia.org/wiki/Han_unification
Related blog post: http://www.sitepoint.com/blogs/2006/08/ ... -dba-blog/