Convert encoding where output cannot represent characters
Posted: Sun Aug 27, 2006 8:54 pm
I don't know the term for this, so attempts to Google have been problematic.
Let's say I have some Chinese text encoded in UTF-8. Now, due to backwards-compatibility reasons, the client is unable to output text in UTF-8: everything must go out in ISO 8859-1. If the output was plaintext, I'd be totally out of luck: Latin-1 unsurprisingly doesn't have support for chinese glyphs.
In HTML, however, character entities come to the rescue. Any Unicode character can be encoded like &#nnnn;, and some of the special ones even have their own readable codes.
So, here's the jive. I need a function that converts text from UTF-8 to an arbitrary character encoding (iconv does this), escapes unexpressible characters with either numeric or character entity references (iconv does not, to my knowledge, do this).
Should this prove to be too cumbersome, simply escape all non-ASCII characters even if the character encoding permits the use of that raw character, albeit with a different byte sequence (this should be achievable without iconv). Downside is it won't work with encodings that are not backwards compatible with ASCII. (I could probably write this, but once again, the above solution is preferred).
Extra plus if I don't have to roll a pure-PHP UTF-8 to Unicode codepoint array parser (I've already got one for another purpose, and I don't relish having to abstract it to support another operation).
The function shouldn't escape special HTML characters, but it's not a big deal if it does due to multiple possible plug-points.
mbstring should be avoided for compatibility reasons. iconv is permissible.
Let's say I have some Chinese text encoded in UTF-8. Now, due to backwards-compatibility reasons, the client is unable to output text in UTF-8: everything must go out in ISO 8859-1. If the output was plaintext, I'd be totally out of luck: Latin-1 unsurprisingly doesn't have support for chinese glyphs.
In HTML, however, character entities come to the rescue. Any Unicode character can be encoded like &#nnnn;, and some of the special ones even have their own readable codes.
So, here's the jive. I need a function that converts text from UTF-8 to an arbitrary character encoding (iconv does this), escapes unexpressible characters with either numeric or character entity references (iconv does not, to my knowledge, do this).
Should this prove to be too cumbersome, simply escape all non-ASCII characters even if the character encoding permits the use of that raw character, albeit with a different byte sequence (this should be achievable without iconv). Downside is it won't work with encodings that are not backwards compatible with ASCII. (I could probably write this, but once again, the above solution is preferred).
Extra plus if I don't have to roll a pure-PHP UTF-8 to Unicode codepoint array parser (I've already got one for another purpose, and I don't relish having to abstract it to support another operation).
The function shouldn't escape special HTML characters, but it's not a big deal if it does due to multiple possible plug-points.
mbstring should be avoided for compatibility reasons. iconv is permissible.