Another thread on UTF-8 and other encodings...

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

when UTF-8 as I understand it using the 8th bit (allowing 256 characters) would suffice all languages?
UTF-8 isn't a single byte encoding. Both UTF-8 and UTF-16 may use up to 4 bytes per character. The difference between them (here I simplify to ease the comprehension) is that ASCII characters in UTF-8 are represented using 1 byte while in UTF-16 they require 2 bytes. And 256 character would not suffice all languages, even if you don't expect characters from different alphabets to be mixed in single string.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

UTF-8 isn't a single byte encoding. Both UTF-8 and UTF-16 may use up to 4 bytes per character
What???? Now i'm formally confused...so if UTF 8 and 16 *may* use up to 4 bytes per character...

WTF...how is it decided? Who determines what number of bytes they use? Why would you use 4 bytes/character - obviously to give a MUCH greater range of possible characters, but still...

I save my language files as UTF-8...using an ASCII keyboard...the file is saved as single byte characters...but is backwards compatible with UTF-8 as I understand it...

How would I get UTF-8 to use more than one byte per character???
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

You didn't read that Wikipedia article did you.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

UTF-8 uses one to four bytes (strictly, octets) per character, depending on the Unicode symbol. Only one byte is needed to encode the 128 US-ASCII characters (Unicode range U+0000 to U+007F). Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF). Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in other planes of Unicode.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

ole wrote:You didn't read that Wikipedia article did you.
I skimmed over it :P

I'll read it again... :? :oops:
wei
Forum Contributor
Posts: 140
Joined: Wed Jul 12, 2006 12:18 am

Post by wei »

The windows XP notepad is UTF-8 aware (well, at least for the ones I have seen), that is, it can read and write UTF-8 encoded text files (and render most characters correctly). Most modern text editors are UTF-8 aware, however, not all will display other encodings and some doesn't render all the characters available in UTF-8 (they just show them as boxes []).

Another thing you may encounter is the Byte-Order-Mark (BOM), for PHP, the BOM is ignored, thus you should not save your text files with BOM.

To display the characters correctly of different encodings, you need the specific code pages as well. The editor which happens to have most of the useful code pages is the emeditor.
EmEditor supports Unicode little endian, Unicode big endian, UTF-8, UTF-7, Baltic, Central European, Chinese Simplified, Chinese Traditional, Cyrillic, Greek, Japanese (Shift-JIS), Japanese (JIS), Japanese (EUC), Korean, Thai, Turkish, Vietnamese, Western European, and all other encodings available in Windows.
In addition, it can actually convert to and from different encodings correctly (so far). That is, if someone sends you Chinese encoded using GB2312 (which is the main encodings used in PRC), you can open this in EmEditor and then save it as UTF-8 correctly.

The Chinese, Japanese and Korean (CJK) encode their languages with different encodings due to political issues (yes, you get politics in character encodings). Further more, the Chinese and Japanese have 3 different encodings each (again, probably due to politics but not sure). Fo example, the GB2312 encoding is adopted by the PRC, GB5 encoding is usually found in HK and Taiwan, etc.
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

PHP i18n charsets wrote:When editing content outside of a (decent) browser, make sure to use an editor with UTF-8 support (i.e. not notepad!)
wei wrote:Another thing you may encounter is the Byte-Order-Mark (BOM), for PHP, the BOM is ignored, thus you should not save your text files with BOM.
In [url=http://en.wikipedia.org/wiki/Byte_Order_Mark]an article about BOM[/url], Wikipedia wrote:and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script
wei wrote:The Chinese, Japanese and Korean (CJK) encode their languages with different encodings due to political issues (yes, you get politics in character encodings). Further more, the Chinese and Japanese have 3 different encodings each (again, probably due to politics but not sure). Fo example, the GB2312 encoding is adopted by the PRC, GB5 encoding is usually found in HK and Taiwan, etc.
That's interesting but you know what the solution is...Unicode!
wei
Forum Contributor
Posts: 140
Joined: Wed Jul 12, 2006 12:18 am

Post by wei »

There are characters within CJK that are not represented in unicode.

http://en.wikipedia.org/wiki/Han_unification
User avatar
Oren
DevNet Resident
Posts: 1640
Joined: Fri Apr 07, 2006 5:13 am
Location: Israel

Post by Oren »

Post Reply