Page 1 of 1

UTF8 Problems

Posted: Fri Nov 20, 2009 4:04 pm
by $var
Hi!

I'm taking a blog feed, parsing it into an array, and spitting it out through my site.
The feed itself doesn't have any funky characters, but when I pass it through the site (which IS ISO-5899-1 encoded) I get this:
Last night, as the performance was about to begin, the emcee’s instructions were clear.“Hold up your cellphones, Blackberrys, iPhones or what have you in front of you;
When I run it through utf8_decode(); it does change the whacky characters ... but to question marks! ?

This is a common problem with the sites I work on, the blog platform has different encoding than the pages.
Anything you can suggest about this?

Re: UTF8 Problems

Posted: Fri Nov 20, 2009 4:49 pm
by Apollo
$var wrote:The feed itself doesn't have any funky characters,
Actually, it does. The correct representation of your text is:

... the emcee’s instructions were clear.“Hold up ...

And this contains two funky chars: the ’ (unicode 8217, instead of the regular ' single quote) and “ (unicode 8220, instead of the regular " double quote).
These funky characters most likely come from some noob copy/pasting stuff from MS Word, which has the tendency to replace regular quotes with funky ones.
but when I pass it through the site (which IS ISO-5899-1 encoded) I get this:
Well there's your problem: those quote characters can't be represented in iso-8859-1. It's an ansi encoding, and only contains limited characters. Just like Chinese or Klingon characters can't be expressed in iso-8859-1, neither can the exotic characters above.
Anything you can suggest about this?
Use utf-8 everywhere: in your html headers, in your content, and in your database collations.

If you still prefer an ansi encoding, then why the heck would you pick the extremely limited iso-8859-1 ? (as oposed to windows-1252 for example, which contains pretty much all iso-8859-1 characters plus some funky ones such as the strange quote thingies).

Alternatively (or on top of that), replace freaky quote chars with regular ones in any content.