Page 1 of 1

Unicode!

Posted: Sun Nov 20, 2005 7:00 pm
by Ambush Commander
Argh... I think the problem is a combination between my text-editor's crappy unicode support and some missing stuff.

Here's what it shows on my text editor:

Code: Select all

<h2>President</h2>
<div>&#26366;&#26195;&#38686; (Sally Zeng)</div>
Tel. 732-723-1274
xiaoxiazng@yahoo.com

<h2>Vice President</h2>
&#26446;&#21488;&#20809; (Taikwang Lee)
Tel. 732-805-0864
taikwangmlee@yahoo.com
Here is the source file displays as: http://www.taijiclub.org/data/Contact.html

Here is what it gets output as: http://www.taijiclub.org/Contact

Code: Select all

$CONTENTS = file_get_contents('data/' . $PAGE . '.html');
Is essentially all that's being done, and then it's echoed (and yes, $PAGE is validated earlier).

So something's gone terribly wrong.

I'm using UltraEdit-32 8.00

Here is some more information: the binary composition of the Contact.html

Code: Select all

--file start--
FF FE 3C 00
68 00 32 00
3E 00 50 00
72 00 65 00
--snip--
FE 66 53 55 (chinese characters)
1E 97 20 00
What I think is happening is that UltraEdit, when in Unicode mode, is padding all the characters with 00, a nonstandard implementation of Unicode. Firefox and PHP interprets them differently, so there's stuff lost in the translation.

I've tried doing reading on Unicode, and I sorta understand what is going on, but I don't know what to do. Help!

Posted: Mon Nov 21, 2005 4:00 am
by Buddha443556
The first problem that I see is http://www.taijiclub.org/Contact appears to be two different encodings - ASCII and UTF-8 (just guessing). Are you using any includes? Are all includes in the same encoding? Yeah ... maybe it's UltraEdit too. The second problem is no declaration of the encoding, such as an HTTP Content-Type header (with charset) or an HTML META element, without one of these the browser will default or try to auto-detect.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#web

Posted: Mon Nov 21, 2005 8:09 pm
by Ambush Commander
UTF-8 should be totally compatible with ASCII. I'm thinking it's not really UTF8... but if not UTF8, then what?

Posted: Tue Nov 22, 2005 8:44 am
by Buddha443556
Ambush Commander wrote:UTF-8 should be totally compatible with ASCII. I'm thinking it's not really UTF8... but if not UTF8, then what?
True. Seems to be USC-2 at least ... some of it? Maybe try finding a new editor? jEdit? Think it supports UTF8.

Posted: Tue Nov 22, 2005 9:33 am
by Maugrim_The_Reaper
You can usually encode any file under ASCII, UNICODE or UTF-8.

I haven't been using Linux editors for a while, but I found Editplus (Windows) has good UTF-8 support. Just be certain the file is created and saved using the correct encoding (default is ASCII). Same goes for any PHP files containing the content, output.

You may possibly need to convert between ASCII and UTF-8.

Content headers will really let you down if you're missing the content encoding...;)

Posted: Tue Nov 22, 2005 2:38 pm
by Ambush Commander
EditPlus is paid... :(

Okay. Here are the issues I'm having with various text editors. Italicized is its status with UTF-8

UltraEdit-32 8.00 (editor of choice) - This version of UltraEdit appears to only be able to switch from ASCII to Unicode. Binary analysis seems to indicate that it uses USC-2 and a BOM (kinda required). Copy paste does work, but it doesn't save as UTF-8, so I'll need to program something to switch it to UTF-8. I think it doesn't have UTF-8 support. No support

Crimson Editor 3.70 - I have a file opened up in UTF-8 mode without a BOM. I have FireFox (great unicode support, by the way), displaying a character correctly. When I copy and paste into Contact.html, it gets replaced with a ?. Trying to type it in ALT+number yields the wrong character. All other modes yield similar results. Perhaps it doesn't have intelligent font matching? Nonetheless, saving the file and then inspecting it in binary mode seems to show that the ? are ? and nothing more. No support

Gvim - WTF. So unintuitive...

jEdit 4.2 final - It doesn't appear to be smart enough to mix and match fonts (big bummer for Chinese writing... you'll have to switch to SimSun or something) BUT it has full support for Unicode (the blocks are still the characters...) and lots of other encodings. It doesn't add a BOM for UTF-8. Very nice, actually. UTF-8 support without BOM, but no font mixing

Notepad, bundled with Windows XP - Now, notepad is very funny. Supports font mixing. Handles UTF-8 well when displaying. The problem? It adds some weird BOM thingy... EF BB BF when you save. Font mixing, but adds BOM

SVN - Not sure about the implications... UTF-16 probably needs to be sent in binary mode (meh).

None of these does exactly what I want, which is:

* UTF-8 support without BOM
* Font Mixing

In the end, I think the font-mixing capabilities of Notepad will trump, and I'll just make sure PHP trims off the BOM when it is present.