Okay, before we confuse the original poster to much, there are two distinct problems we need to deal with:
1. Ensuring that inputted and outputted data is in UTF-8
2. Transforming legacy data to UTF-8
The first is quite easy to do: use the suggestions commented upon earlier. The second is a little trickier. No amount of
html_entity_decode() or
htmlentities() is going to the job for you if the input text is encoded wrong: you'll have to bring out the big guns:
iconv(). First:
1. Determine what character set the page is in. I recommend opening the file in Firefox, checking whether or not the characters come out right, and if so then clicking "Tools > Page Info" looking for "Encoding". Then, use iconv to transform to the proper character set.
html_entity_decode(), I repeat, does not do the trick. It may seem to do it though (due to the fact that browsers ampersand encode characters that aren't supported by the form transmission), but appearances are deceiving. Plus, as kamel complains, UTF-8 is only supported in PHP 5. Here's the workaround. While it's immediate purpose may not imply it,
HTML Purifier comes with a solid set of UTF-8 functions that are very easy to use. Install it as shown, use its function like this:
Code: Select all
$parser = new HTMLPurifier_EntityParser();
$newtext = $parser->substituteNonSpecialEntities($oldtext);
And all your entities will be gone, replaced with the real characters. This should not be necessary though.
Multibyte functions are not strictly necessary for most usage of UTF-8: for instance, if you're looking for an occurrence of a word in UTF-8, strpos() will suffice. You will want to, however, ensure that all data is well formed UTF-8 with this: HTMLPurifier_Encoder::cleanUTF8($text);
And we haven't even begun to talk about internationalization and localization. ::phew::