Page 2 of 2

Posted: Mon Oct 30, 2006 11:10 am
by jmut
kamel wrote:The main problem is that i don't have mbstring on the production server and the html_entity_decode seem to
doesn't work becasuse php throw an error message when i use utf-8 like third param.
Can you give me any advice ?
my advice is to rethink your workflow a bit. might be a disaster depending on how large your application is.
But usually you would not need html_entity_decode() I think.
So store data or whatever you have as raw data (after validate) to the DB.
Then when you need to display data in html context just use htmlentities()

Posted: Mon Oct 30, 2006 8:00 pm
by Ambush Commander
Okay, before we confuse the original poster to much, there are two distinct problems we need to deal with:

1. Ensuring that inputted and outputted data is in UTF-8
2. Transforming legacy data to UTF-8

The first is quite easy to do: use the suggestions commented upon earlier. The second is a little trickier. No amount of html_entity_decode() or htmlentities() is going to the job for you if the input text is encoded wrong: you'll have to bring out the big guns: iconv(). First:

1. Determine what character set the page is in. I recommend opening the file in Firefox, checking whether or not the characters come out right, and if so then clicking "Tools > Page Info" looking for "Encoding". Then, use iconv to transform to the proper character set.

html_entity_decode(), I repeat, does not do the trick. It may seem to do it though (due to the fact that browsers ampersand encode characters that aren't supported by the form transmission), but appearances are deceiving. Plus, as kamel complains, UTF-8 is only supported in PHP 5. Here's the workaround. While it's immediate purpose may not imply it, HTML Purifier comes with a solid set of UTF-8 functions that are very easy to use. Install it as shown, use its function like this:

Code: Select all

$parser = new HTMLPurifier_EntityParser();
$newtext = $parser->substituteNonSpecialEntities($oldtext);
And all your entities will be gone, replaced with the real characters. This should not be necessary though.

Multibyte functions are not strictly necessary for most usage of UTF-8: for instance, if you're looking for an occurrence of a word in UTF-8, strpos() will suffice. You will want to, however, ensure that all data is well formed UTF-8 with this: HTMLPurifier_Encoder::cleanUTF8($text);

And we haven't even begun to talk about internationalization and localization. ::phew::

Posted: Tue Oct 31, 2006 3:10 am
by CoderGoblin
Ambush Commander wrote: Multibyte functions are not strictly necessary for most usage of UTF-8: for instance, if you're looking for an occurrence of a word in UTF-8, strpos() will suffice...
Agreed.. I would say use standard string functions where possible but be aware of Multibyte functions as they may solve some issues. Testing is very important.

Roll on PHP 6... (hope it doesn't break much)