Page 1 of 1

get_html_translation_table only returns 100 entities

Posted: Wed Jul 01, 2009 11:48 am
by batfastad
Hi everyone

When I run this code:

Code: Select all

echo count(get_html_translation_table(HTML_ENTITIES));
I only get 100 entities returned

However this page http://en.wikipedia.org/wiki/List_of_XM ... es_in_HTML says there should be 253 total valid XHTML entity codes

Is there a copy & paste or serialised PHP array anywhere I can use which has that full list of 253 entity codes and their UTF8 characters?

Cheers, B

Re: get_html_translation_table only returns 100 entities

Posted: Wed Jul 01, 2009 12:00 pm
by Eric!
The php functions aren't complete for some reason. You can search around the web but I haven't seen any complete functions. There are some bits that people have posted in the manual for
http://php.net/htmlentities

Re: get_html_translation_table only returns 100 entities

Posted: Wed Jul 01, 2009 12:01 pm
by BornForCode
There are two new constants (HTML_ENTITIES, HTML_SPECIALCHARS) that allow you to specify the table you want.

For utf8 try:

Code: Select all

 
function translation_table_to_utf8($arTranslationtable)
{
    //loop through the array and convert everything both keys and values
    foreach($arTranslationtable as $charkey => $char)
    {
        $charkey = utf8_encode($charkey);
        $arUTFchars[$charkey]= utf8_encode($char);
    } 
     return $arUTFchars;
}
 
//get the translation table
$arSpecialchar     = get_html_translation_table(HTML_ENTITIES);
 
//call the function to convert to utf-8
$arUTFchars = translation_table_to_utf8($arSpecialchar);
print_r($arUTFchars);
 
 

Re: get_html_translation_table only returns 100 entities

Posted: Wed Jul 01, 2009 12:10 pm
by Eric!
Sorry, this is the page with example code for values not in the tables
http://cr.php.net/manual/en/function.ge ... -table.php

Re: get_html_translation_table only returns 100 entities

Posted: Thu Jul 02, 2009 11:19 am
by batfastad
@BornForCode... yeah I tried the HTML_ENTITIES constant (see my code above) but it still doesn't return all 252 (HTML) or 253 (XHTML)
That code you posted still only seems to return the 100 entities listed in get_html_translation_table()

@Eric
EDIT: Ah ok, yes those comments by Chris in the manual contain the missing characters. Thanks for the info.

I can combine that with get_html_translation_table() and convert all text entities into their numeric codes.
But then what's the best way to convert numeric codes to their actual characters?
It seems the mb_decode_numericentity() function only handles decimal numeric codes (&#NNNN;) and not the hex format (&#xNNNN;)

Any ideas?
Cheers, B

Re: get_html_translation_table only returns 100 entities

Posted: Fri Jul 03, 2009 10:56 am
by batfastad
Right after a day of faffing about, here we are!

Here's a PHP variable for all XHTML entities with their equivalent numeric char reference:

Code: Select all

$entities_xhtml = array('"'=>'"', '<'=>'<', '>'=>'>', '&'=>'&', '&nbsp;'=>'&#160;', '&apos;'=>''', '&iexcl;'=>'&#161;', '&cent;'=>'&#162;', '&pound;'=>'&#163;', '&curren;'=>'&#164;', '&yen;'=>'&#165;', '&brvbar;'=>'&#166;', '&sect;'=>'&#167;', '&uml;'=>'&#168;', '&copy;'=>'&#169;', '&ordf;'=>'&#170;', '&laquo;'=>'&#171;', '&not;'=>'&#172;', '&shy;'=>'&#173;', '&reg;'=>'&#174;', '&macr;'=>'&#175;', '&deg;'=>'&#176;', '&plusmn;'=>'&#177;', '&sup2;'=>'&#178;', '&sup3;'=>'&#179;', '&acute;'=>'&#180;', '&micro;'=>'&#181;', '&para;'=>'&#182;', '&middot;'=>'&#183;', '&cedil;'=>'&#184;', '&sup1;'=>'&#185;', '&ordm;'=>'&#186;', '&raquo;'=>'&#187;', '&frac14;'=>'&#188;', '&frac12;'=>'&#189;', '&frac34;'=>'&#190;', '&iquest;'=>'&#191;', '&Agrave;'=>'&#192;', '&Aacute;'=>'&#193;', '&Acirc;'=>'&#194;', '&Atilde;'=>'&#195;', '&Auml;'=>'&#196;', '&Aring;'=>'&#197;', '&AElig;'=>'&#198;', '&Ccedil;'=>'&#199;', '&Egrave;'=>'&#200;', '&Eacute;'=>'&#201;', '&Ecirc;'=>'&#202;', '&Euml;'=>'&#203;', '&Igrave;'=>'&#204;', '&Iacute;'=>'&#205;', '&Icirc;'=>'&#206;', '&Iuml;'=>'&#207;', '&ETH;'=>'&#208;', '&Ntilde;'=>'&#209;', '&Ograve;'=>'&#210;', '&Oacute;'=>'&#211;', '&Ocirc;'=>'&#212;', '&Otilde;'=>'&#213;', '&Ouml;'=>'&#214;', '&times;'=>'&#215;', '&Oslash;'=>'&#216;', '&Ugrave;'=>'&#217;', '&Uacute;'=>'&#218;', '&Ucirc;'=>'&#219;', '&Uuml;'=>'&#220;', '&Yacute;'=>'&#221;', '&THORN;'=>'&#222;', '&szlig;'=>'&#223;', '&agrave;'=>'&#224;', '&aacute;'=>'&#225;', '&acirc;'=>'&#226;', '&atilde;'=>'&#227;', '&auml;'=>'&#228;', '&aring;'=>'&#229;', '&aelig;'=>'&#230;', '&ccedil;'=>'&#231;', '&egrave;'=>'&#232;', '&eacute;'=>'&#233;', '&ecirc;'=>'&#234;', '&euml;'=>'&#235;', '&igrave;'=>'&#236;', '&iacute;'=>'&#237;', '&icirc;'=>'&#238;', '&iuml;'=>'&#239;', '&eth;'=>'&#240;', '&ntilde;'=>'&#241;', '&ograve;'=>'&#242;', '&oacute;'=>'&#243;', '&ocirc;'=>'&#244;', '&otilde;'=>'&#245;', '&ouml;'=>'&#246;', '&divide;'=>'&#247;', '&oslash;'=>'&#248;', '&ugrave;'=>'&#249;', '&uacute;'=>'&#250;', '&ucirc;'=>'&#251;', '&uuml;'=>'&#252;', '&yacute;'=>'&#253;', '&thorn;'=>'&#254;', '&yuml;'=>'&#255;', '&minus;'=>'&#8722;', '&circ;'=>'&#710;', '&tilde;'=>'&#732;', '&Scaron;'=>'&#352;', '&lsaquo;'=>'&#8249;', '&OElig;'=>'&#338;', '&lsquo;'=>'&#8216;', '&rsquo;'=>'&#8217;', '&ldquo;'=>'&#8220;', '&rdquo;'=>'&#8221;', '&bull;'=>'&#8226;', '&ndash;'=>'&#8211;', '&mdash;'=>'&#8212;', '&trade;'=>'&#8482;', '&scaron;'=>'&#353;', '&rsaquo;'=>'&#8250;', '&oelig;'=>'&#339;', '&Yuml;'=>'&#376;', '&fnof;'=>'&#402;', '&Alpha;'=>'&#913;', '&Beta;'=>'&#914;', '&Gamma;'=>'&#915;', '&Delta;'=>'&#916;', '&Epsilon;'=>'&#917;', '&Zeta;'=>'&#918;', '&Eta;'=>'&#919;', '&Theta;'=>'&#920;', '&Iota;'=>'&#921;', '&Kappa;'=>'&#922;', '&Lambda;'=>'&#923;', '&Mu;'=>'&#924;', '&Nu;'=>'&#925;', '&Xi;'=>'&#926;', '&Omicron;'=>'&#927;', '&Pi;'=>'&#928;', '&Rho;'=>'&#929;', '&Sigma;'=>'&#931;', '&Tau;'=>'&#932;', '&Upsilon;'=>'&#933;', '&Phi;'=>'&#934;', '&Chi;'=>'&#935;', '&Psi;'=>'&#936;', '&Omega;'=>'&#937;', '&alpha;'=>'&#945;', '&beta;'=>'&#946;', '&gamma;'=>'&#947;', '&delta;'=>'&#948;', '&epsilon;'=>'&#949;', '&zeta;'=>'&#950;', '&eta;'=>'&#951;', '&theta;'=>'&#952;', '&iota;'=>'&#953;', '&kappa;'=>'&#954;', '&lambda;'=>'&#955;', '&mu;'=>'&#956;', '&nu;'=>'&#957;', '&xi;'=>'&#958;', '&omicron;'=>'&#959;', '&pi;'=>'&#960;', '&rho;'=>'&#961;', '&sigmaf;'=>'&#962;', '&sigma;'=>'&#963;', '&tau;'=>'&#964;', '&upsilon;'=>'&#965;', '&phi;'=>'&#966;', '&chi;'=>'&#967;', '&psi;'=>'&#968;', '&omega;'=>'&#969;', '&thetasym;'=>'&#977;', '&upsih;'=>'&#978;', '&piv;'=>'&#982;', '&ensp;'=>'&#8194;', '&emsp;'=>'&#8195;', '&thinsp;'=>'&#8201;', '&zwnj;'=>'&#8204;', '&zwj;'=>'&#8205;', '&lrm;'=>'&#8206;', '&rlm;'=>'&#8207;', '&sbquo;'=>'&#8218;', '&bdquo;'=>'&#8222;', '&dagger;'=>'&#8224;', '&Dagger;'=>'&#8225;', '&hellip;'=>'&#8230;', '&permil;'=>'&#8240;', '&prime;'=>'&#8242;', '&Prime;'=>'&#8243;', '&oline;'=>'&#8254;', '&frasl;'=>'&#8260;', '&euro;'=>'&#8364;', '&image;'=>'&#8465;', '&weierp;'=>'&#8472;', '&real;'=>'&#8476;', '&alefsym;'=>'&#8501;', '&larr;'=>'&#8592;', '&uarr;'=>'&#8593;', '&rarr;'=>'&#8594;', '&darr;'=>'&#8595;', '&harr;'=>'&#8596;', '&crarr;'=>'&#8629;', '&lArr;'=>'&#8656;', '&uArr;'=>'&#8657;', '&rArr;'=>'&#8658;', '&dArr;'=>'&#8659;', '&hArr;'=>'&#8660;', '&forall;'=>'&#8704;', '&part;'=>'&#8706;', '&exist;'=>'&#8707;', '&empty;'=>'&#8709;', '&nabla;'=>'&#8711;', '&isin;'=>'&#8712;', '&notin;'=>'&#8713;', '&ni;'=>'&#8715;', '&prod;'=>'&#8719;', '&sum;'=>'&#8721;', '&lowast;'=>'&#8727;', '&radic;'=>'&#8730;', '&prop;'=>'&#8733;', '&infin;'=>'&#8734;', '&ang;'=>'&#8736;', '&and;'=>'&#8743;', '&or;'=>'&#8744;', '&cap;'=>'&#8745;', '&cup;'=>'&#8746;', '&int;'=>'&#8747;', '&there4;'=>'&#8756;', '&sim;'=>'&#8764;', '&cong;'=>'&#8773;', '&asymp;'=>'&#8776;', '&ne;'=>'&#8800;', '&equiv;'=>'&#8801;', '&le;'=>'&#8804;', '&ge;'=>'&#8805;', '&sub;'=>'&#8834;', '&sup;'=>'&#8835;', '&nsub;'=>'&#8836;', '&sube;'=>'&#8838;', '&supe;'=>'&#8839;', '&oplus;'=>'&#8853;', '&otimes;'=>'&#8855;', '&perp;'=>'&#8869;', '&sdot;'=>'&#8901;', '&lceil;'=>'&#8968;', '&rceil;'=>'&#8969;', '&lfloor;'=>'&#8970;', '&rfloor;'=>'&#8971;', '&lang;'=>'&#9001;', '&rang;'=>'&#9002;', '&loz;'=>'&#9674;', '&spades;'=>'&#9824;', '&clubs;'=>'&#9827;', '&hearts;'=>'&#9829;', '&diams;'=>'&#9830;');
 
$entities_xhtml_preserve = array('"'=>'"', '<'=>'<', '>'=>'>', '&'=>'&', '&nbsp;'=>'&#160;');
My next step was to convert all text entity codes to their numeric equivalents, apart from $entities_xhtml_preserve above... I want the user of the CMS to be able to enter the text entities of < > & " and &nbsp;
But all other text entities are converted to numerics with this:

Code: Select all

// convert entities to numerics... preserve selected
foreach ($entities_xhtml_preserve as $key => $value) {
    unset($entities_xhtml[$key]);
}
$var = str_replace( array_keys($entities_xhtml), array_values($entities_xhtml), $var);
Final step is to convert all decimal numeric entities and hex numeric entities to their proper UTF8 character for saving into my DB:

Code: Select all

function numeric_entities_to_chars($var) {
 
    //callback function for the regex
    function utf8_entity_decode($entity){
        $convmap = array(0x0, 0x10000, 0, 0xfffff);
        return mb_decode_numericentity($entity, $convmap, 'UTF-8');
    }
 
    //decode decimal
    $var = preg_replace('/&#\d{2,5};/ue', "utf8_entity_decode('$0')", $var);
    //decode hex
    $var = preg_replace('/&#x([a-fA-F0-9]{2,8});/ue', "utf8_entity_decode('&#'.hexdec('$1').';')", $var);
 
    return $var;
}
 
echo numeric_entities_to_chars($var);
My thanks to Andrew Simpson's function in his comment on the mb_decode_numericentity() man page (http://uk.php.net/manual/en/function.mb ... entity.php)... saved many hours!

Hope this helps someone out ;)

Re: get_html_translation_table only returns 100 entities

Posted: Fri Jul 03, 2009 11:10 am
by Eric!
batfastad thanks for sharing all your work on this, very nice of you. :drunk: