Unicode!

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Unicode!

Post by Ambush Commander »

Argh... I think the problem is a combination between my text-editor's crappy unicode support and some missing stuff.

Here's what it shows on my text editor:

Code: Select all

<h2>President</h2>
<div>&#26366;&#26195;&#38686; (Sally Zeng)</div>
Tel. 732-723-1274
xiaoxiazng@yahoo.com

<h2>Vice President</h2>
&#26446;&#21488;&#20809; (Taikwang Lee)
Tel. 732-805-0864
taikwangmlee@yahoo.com
Here is the source file displays as: http://www.taijiclub.org/data/Contact.html

Here is what it gets output as: http://www.taijiclub.org/Contact

Code: Select all

$CONTENTS = file_get_contents('data/' . $PAGE . '.html');
Is essentially all that's being done, and then it's echoed (and yes, $PAGE is validated earlier).

So something's gone terribly wrong.

I'm using UltraEdit-32 8.00

Here is some more information: the binary composition of the Contact.html

Code: Select all

--file start--
FF FE 3C 00
68 00 32 00
3E 00 50 00
72 00 65 00
--snip--
FE 66 53 55 (chinese characters)
1E 97 20 00
What I think is happening is that UltraEdit, when in Unicode mode, is padding all the characters with 00, a nonstandard implementation of Unicode. Firefox and PHP interprets them differently, so there's stuff lost in the translation.

I've tried doing reading on Unicode, and I sorta understand what is going on, but I don't know what to do. Help!
User avatar
Buddha443556
Forum Regular
Posts: 873
Joined: Fri Mar 19, 2004 1:51 pm

Post by Buddha443556 »

The first problem that I see is http://www.taijiclub.org/Contact appears to be two different encodings - ASCII and UTF-8 (just guessing). Are you using any includes? Are all includes in the same encoding? Yeah ... maybe it's UltraEdit too. The second problem is no declaration of the encoding, such as an HTTP Content-Type header (with charset) or an HTML META element, without one of these the browser will default or try to auto-detect.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#web
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

UTF-8 should be totally compatible with ASCII. I'm thinking it's not really UTF8... but if not UTF8, then what?
User avatar
Buddha443556
Forum Regular
Posts: 873
Joined: Fri Mar 19, 2004 1:51 pm

Post by Buddha443556 »

Ambush Commander wrote:UTF-8 should be totally compatible with ASCII. I'm thinking it's not really UTF8... but if not UTF8, then what?
True. Seems to be USC-2 at least ... some of it? Maybe try finding a new editor? jEdit? Think it supports UTF8.
Last edited by Buddha443556 on Tue Nov 22, 2005 9:38 am, edited 1 time in total.
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

You can usually encode any file under ASCII, UNICODE or UTF-8.

I haven't been using Linux editors for a while, but I found Editplus (Windows) has good UTF-8 support. Just be certain the file is created and saved using the correct encoding (default is ASCII). Same goes for any PHP files containing the content, output.

You may possibly need to convert between ASCII and UTF-8.

Content headers will really let you down if you're missing the content encoding...;)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

EditPlus is paid... :(

Okay. Here are the issues I'm having with various text editors. Italicized is its status with UTF-8

UltraEdit-32 8.00 (editor of choice) - This version of UltraEdit appears to only be able to switch from ASCII to Unicode. Binary analysis seems to indicate that it uses USC-2 and a BOM (kinda required). Copy paste does work, but it doesn't save as UTF-8, so I'll need to program something to switch it to UTF-8. I think it doesn't have UTF-8 support. No support

Crimson Editor 3.70 - I have a file opened up in UTF-8 mode without a BOM. I have FireFox (great unicode support, by the way), displaying a character correctly. When I copy and paste into Contact.html, it gets replaced with a ?. Trying to type it in ALT+number yields the wrong character. All other modes yield similar results. Perhaps it doesn't have intelligent font matching? Nonetheless, saving the file and then inspecting it in binary mode seems to show that the ? are ? and nothing more. No support

Gvim - WTF. So unintuitive...

jEdit 4.2 final - It doesn't appear to be smart enough to mix and match fonts (big bummer for Chinese writing... you'll have to switch to SimSun or something) BUT it has full support for Unicode (the blocks are still the characters...) and lots of other encodings. It doesn't add a BOM for UTF-8. Very nice, actually. UTF-8 support without BOM, but no font mixing

Notepad, bundled with Windows XP - Now, notepad is very funny. Supports font mixing. Handles UTF-8 well when displaying. The problem? It adds some weird BOM thingy... EF BB BF when you save. Font mixing, but adds BOM

SVN - Not sure about the implications... UTF-16 probably needs to be sent in binary mode (meh).

None of these does exactly what I want, which is:

* UTF-8 support without BOM
* Font Mixing

In the end, I think the font-mixing capabilities of Notepad will trump, and I'll just make sure PHP trims off the BOM when it is present.
Post Reply