Convert encoding where output cannot represent characters
Moderator: General Moderators
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Convert encoding where output cannot represent characters
I don't know the term for this, so attempts to Google have been problematic.
Let's say I have some Chinese text encoded in UTF-8. Now, due to backwards-compatibility reasons, the client is unable to output text in UTF-8: everything must go out in ISO 8859-1. If the output was plaintext, I'd be totally out of luck: Latin-1 unsurprisingly doesn't have support for chinese glyphs.
In HTML, however, character entities come to the rescue. Any Unicode character can be encoded like &#nnnn;, and some of the special ones even have their own readable codes.
So, here's the jive. I need a function that converts text from UTF-8 to an arbitrary character encoding (iconv does this), escapes unexpressible characters with either numeric or character entity references (iconv does not, to my knowledge, do this).
Should this prove to be too cumbersome, simply escape all non-ASCII characters even if the character encoding permits the use of that raw character, albeit with a different byte sequence (this should be achievable without iconv). Downside is it won't work with encodings that are not backwards compatible with ASCII. (I could probably write this, but once again, the above solution is preferred).
Extra plus if I don't have to roll a pure-PHP UTF-8 to Unicode codepoint array parser (I've already got one for another purpose, and I don't relish having to abstract it to support another operation).
The function shouldn't escape special HTML characters, but it's not a big deal if it does due to multiple possible plug-points.
mbstring should be avoided for compatibility reasons. iconv is permissible.
Let's say I have some Chinese text encoded in UTF-8. Now, due to backwards-compatibility reasons, the client is unable to output text in UTF-8: everything must go out in ISO 8859-1. If the output was plaintext, I'd be totally out of luck: Latin-1 unsurprisingly doesn't have support for chinese glyphs.
In HTML, however, character entities come to the rescue. Any Unicode character can be encoded like &#nnnn;, and some of the special ones even have their own readable codes.
So, here's the jive. I need a function that converts text from UTF-8 to an arbitrary character encoding (iconv does this), escapes unexpressible characters with either numeric or character entity references (iconv does not, to my knowledge, do this).
Should this prove to be too cumbersome, simply escape all non-ASCII characters even if the character encoding permits the use of that raw character, albeit with a different byte sequence (this should be achievable without iconv). Downside is it won't work with encodings that are not backwards compatible with ASCII. (I could probably write this, but once again, the above solution is preferred).
Extra plus if I don't have to roll a pure-PHP UTF-8 to Unicode codepoint array parser (I've already got one for another purpose, and I don't relish having to abstract it to support another operation).
The function shouldn't escape special HTML characters, but it's not a big deal if it does due to multiple possible plug-points.
mbstring should be avoided for compatibility reasons. iconv is permissible.
this data
http://trac.akelos.org/cgi-bin/trac.cgi ... 8_mappings
should be enough (doesn't look to be complete though) to calculate the difference between encodings and then to change them into html numbered entities.
http://trac.akelos.org/cgi-bin/trac.cgi ... 8_mappings
should be enough (doesn't look to be complete though) to calculate the difference between encodings and then to change them into html numbered entities.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
Code: Select all
iconv('someencoding', 'UTF-8', htmlentities($input, ENT_QUOTES, 'UTF-8'));- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
I was so hoping that would work... nope. No good.
Here's the test I used for it:
What happens is that htmlentities() doesn't touch characters not in its lookup table: no numeric entities out of that function besides the apostrophe. So then when you iconv it everything gets lost. If we replace the htmlentities with something that html-entity-izes everything non-ASCII, we get the second (less desirable) solution.
I've done more research on this, and it's honestly not worth the trouble to try to support full i18n when you're using an 8-bit character encoding: using UTF-8 magically solves all these problems. Marking this feature request (by me) WONTFIX.
Here's the test I used for it:
Code: Select all
<?php
if (!empty($_POST) && isset($_POST['charset']) && isset($_POST['text'])) {
$output = true;
header('Content-type:text/html;charset=' . urlencode($_POST['charset']));
} else {
$output = false;
header('Content-type:text/html;charset=UTF-8');
}
?>
<html>
<head>
<title>htmlentities + iconv Test</title>
</head>
<body>
<?php if ($output) {
echo iconv('UTF-8', $_POST['charset'],
htmlentities($_POST['text'], ENT_QUOTES, 'UTF-8')
);
} else { ?>
<p>Type in text in UTF-8, see it translated to another encoding.</p>
<form method="post" action="2006-08-28.php">
<fieldset>
<textarea name="text" cols="60" rows="20"></textarea>
</fieldset>
<fieldset>
<select name="charset">
<option value="UTF-8">UTF-8</option>
<option value="ISO-8859-1">ISO-8859-1</option>
</select>
<input type="submit" value="Submit" />
</fieldset>
</form>
<?php } ?>
</body>
</html>I've done more research on this, and it's honestly not worth the trouble to try to support full i18n when you're using an 8-bit character encoding: using UTF-8 magically solves all these problems. Marking this feature request (by me) WONTFIX.
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
Well that is disappointing.
Fair enough. Can I see the research you did?I've done more research on this, and it's honestly not worth the trouble to try to support full i18n when you're using an 8-bit character encoding: using UTF-8 magically solves all these problems. Marking this feature request (by me) WONTFIX.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
One of the links you posted, actually: http://ppewww.ph.gla.ac.uk/~flavell/cha ... -i18n.html
Essentially, it's so difficult to be multilingual on an 8-bit encoding, that those people have bigger problems to worry about than a third-party library that can't use anything besides UTF-8.
I recognize, however, that some people use sites that aren't i18n, so lossy encoding conversion done with iconv is probably acceptable.
Essentially, it's so difficult to be multilingual on an 8-bit encoding, that those people have bigger problems to worry about than a third-party library that can't use anything besides UTF-8.
I recognize, however, that some people use sites that aren't i18n, so lossy encoding conversion done with iconv is probably acceptable.
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
Agree strongly.I recognize, however, that some people use sites that aren't i18n, so lossy encoding conversion done with iconv is probably acceptable.
[offTopic]Right well I'm about to start going through all my code UTF-8ifing where necessary. Brace yourself eyeballs, its gonna be a long one.[/offTopic]
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Code: Select all
/**
* Decodes a single UTF8 char to it's representation as
* specified in the mapping array
*
* @access private
* @see _Utf8StringDecode
* @param array $chars Assoc array with chars to be decoded
* @param integer &$id Current char position
* @param array $mapping_array Mapping Array
* @return string Decoded char
*/
function _Utf8ToChar($chars, &$id, $mapping_array)
{
if(($chars[$id]>=240)&&($chars[$id]<=255)){
$utf=(intval($chars[$id]-240)<<18)+(intval($chars[++$id]-128)<<12)+(intval($chars[++$id]-128)<<6)+(intval($chars[++$id]-128)<<0);
}elseif(($chars[$id]>=224)&&($chars[$id]<=239)){
$utf=(intval($chars[$id]-224)<<12)+(intval($chars[++$id]-128)<<6)+(intval($chars[++$id]-128)<<0);
}elseif(($chars[$id]>=192)&&($chars[$id]<=223)){
$utf=(intval($chars[$id]-192)<<6)+(intval($chars[++$id]-128)<<0);
}else{
$utf=$chars[$id];
}
if(array_key_exists($utf,$mapping_array)){
return chr($mapping_array[$utf]);
}else{
return $this->utf8ErrorChar;
}
}// -- end of &_Utf8ToChar -- //- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
Oh wow, will that speed up big arrays then? Harry Fuecks would love that, his UTF-8 library has some nice big arrays.it woudl probably be a better idea to serialize the lookup tables.
Code: Select all
$UTF8_UPPER_TO_LOWER = array(
0x0041=>0x0061, 0x03A6=>0x03C6, 0x0162=>0x0163, 0x00C5=>0x00E5, 0x0042=>0x0062,
0x0139=>0x013A, 0x00C1=>0x00E1, 0x0141=>0x0142, 0x038E=>0x03CD, 0x0100=>0x0101,
0x0490=>0x0491, 0x0394=>0x03B4, 0x015A=>0x015B, 0x0044=>0x0064, 0x0393=>0x03B3,
0x00D4=>0x00F4, 0x042A=>0x044A, 0x0419=>0x0439, 0x0112=>0x0113, 0x041C=>0x043C,
0x015E=>0x015F, 0x0143=>0x0144, 0x00CE=>0x00EE, 0x040E=>0x045E, 0x042F=>0x044F,
0x039A=>0x03BA, 0x0154=>0x0155, 0x0049=>0x0069, 0x0053=>0x0073, 0x1E1E=>0x1E1F,
0x0134=>0x0135, 0x0427=>0x0447, 0x03A0=>0x03C0, 0x0418=>0x0438, 0x00D3=>0x00F3,
0x0420=>0x0440, 0x0404=>0x0454, 0x0415=>0x0435, 0x0429=>0x0449, 0x014A=>0x014B,
0x0411=>0x0431, 0x0409=>0x0459, 0x1E02=>0x1E03, 0x00D6=>0x00F6, 0x00D9=>0x00F9,
0x004E=>0x006E, 0x0401=>0x0451, 0x03A4=>0x03C4, 0x0423=>0x0443, 0x015C=>0x015D,
0x0403=>0x0453, 0x03A8=>0x03C8, 0x0158=>0x0159, 0x0047=>0x0067, 0x00C4=>0x00E4,
0x0386=>0x03AC, 0x0389=>0x03AE, 0x0166=>0x0167, 0x039E=>0x03BE, 0x0164=>0x0165,
0x0116=>0x0117, 0x0108=>0x0109, 0x0056=>0x0076, 0x00DE=>0x00FE, 0x0156=>0x0157,
0x00DA=>0x00FA, 0x1E60=>0x1E61, 0x1E82=>0x1E83, 0x00C2=>0x00E2, 0x0118=>0x0119,
0x0145=>0x0146, 0x0050=>0x0070, 0x0150=>0x0151, 0x042E=>0x044E, 0x0128=>0x0129,
0x03A7=>0x03C7, 0x013D=>0x013E, 0x0422=>0x0442, 0x005A=>0x007A, 0x0428=>0x0448,
0x03A1=>0x03C1, 0x1E80=>0x1E81, 0x016C=>0x016D, 0x00D5=>0x00F5, 0x0055=>0x0075,
0x0176=>0x0177, 0x00DC=>0x00FC, 0x1E56=>0x1E57, 0x03A3=>0x03C3, 0x041A=>0x043A,
0x004D=>0x006D, 0x016A=>0x016B, 0x0170=>0x0171, 0x0424=>0x0444, 0x00CC=>0x00EC,
0x0168=>0x0169, 0x039F=>0x03BF, 0x004B=>0x006B, 0x00D2=>0x00F2, 0x00C0=>0x00E0,
0x0414=>0x0434, 0x03A9=>0x03C9, 0x1E6A=>0x1E6B, 0x00C3=>0x00E3, 0x042D=>0x044D,
0x0416=>0x0436, 0x01A0=>0x01A1, 0x010C=>0x010D, 0x011C=>0x011D, 0x00D0=>0x00F0,
0x013B=>0x013C, 0x040F=>0x045F, 0x040A=>0x045A, 0x00C8=>0x00E8, 0x03A5=>0x03C5,
0x0046=>0x0066, 0x00DD=>0x00FD, 0x0043=>0x0063, 0x021A=>0x021B, 0x00CA=>0x00EA,
0x0399=>0x03B9, 0x0179=>0x017A, 0x00CF=>0x00EF, 0x01AF=>0x01B0, 0x0045=>0x0065,
0x039B=>0x03BB, 0x0398=>0x03B8, 0x039C=>0x03BC, 0x040C=>0x045C, 0x041F=>0x043F,
0x042C=>0x044C, 0x00DE=>0x00FE, 0x00D0=>0x00F0, 0x1EF2=>0x1EF3, 0x0048=>0x0068,
0x00CB=>0x00EB, 0x0110=>0x0111, 0x0413=>0x0433, 0x012E=>0x012F, 0x00C6=>0x00E6,
0x0058=>0x0078, 0x0160=>0x0161, 0x016E=>0x016F, 0x0391=>0x03B1, 0x0407=>0x0457,
0x0172=>0x0173, 0x0178=>0x00FF, 0x004F=>0x006F, 0x041B=>0x043B, 0x0395=>0x03B5,
0x0425=>0x0445, 0x0120=>0x0121, 0x017D=>0x017E, 0x017B=>0x017C, 0x0396=>0x03B6,
0x0392=>0x03B2, 0x0388=>0x03AD, 0x1E84=>0x1E85, 0x0174=>0x0175, 0x0051=>0x0071,
0x0417=>0x0437, 0x1E0A=>0x1E0B, 0x0147=>0x0148, 0x0104=>0x0105, 0x0408=>0x0458,
0x014C=>0x014D, 0x00CD=>0x00ED, 0x0059=>0x0079, 0x010A=>0x010B, 0x038F=>0x03CE,
0x0052=>0x0072, 0x0410=>0x0430, 0x0405=>0x0455, 0x0402=>0x0452, 0x0126=>0x0127,
0x0136=>0x0137, 0x012A=>0x012B, 0x038A=>0x03AF, 0x042B=>0x044B, 0x004C=>0x006C,
0x0397=>0x03B7, 0x0124=>0x0125, 0x0218=>0x0219, 0x00DB=>0x00FB, 0x011E=>0x011F,
0x041E=>0x043E, 0x1E40=>0x1E41, 0x039D=>0x03BD, 0x0106=>0x0107, 0x03AB=>0x03CB,
0x0426=>0x0446, 0x00DE=>0x00FE, 0x00C7=>0x00E7, 0x03AA=>0x03CA, 0x0421=>0x0441,
0x0412=>0x0432, 0x010E=>0x010F, 0x00D8=>0x00F8, 0x0057=>0x0077, 0x011A=>0x011B,
0x0054=>0x0074, 0x004A=>0x006A, 0x040B=>0x045B, 0x0406=>0x0456, 0x0102=>0x0103,
0x039B=>0x03BB, 0x00D1=>0x00F1, 0x041D=>0x043D, 0x038C=>0x03CC, 0x00C9=>0x00E9,
0x00D0=>0x00F0, 0x0407=>0x0457, 0x0122=>0x0123,
);- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
If you are using an opcode cache such as APC then yes.
Hmm I think I'm going to have to install some kind of profiling extension so I can actually work out what is fast and what isn't. It seems people talk about performance all the time, myself most certainly included and yet nobody actually knows, for a fact, what is fast and what isn't. Its all based on little hints of inaccurate information.
Hmm I think I'm going to have to install some kind of profiling extension so I can actually work out what is fast and what isn't. It seems people talk about performance all the time, myself most certainly included and yet nobody actually knows, for a fact, what is fast and what isn't. Its all based on little hints of inaccurate information.