get_html_translation_table only returns 100 entities

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

get_html_translation_table only returns 100 entities

Post by batfastad »

Hi everyone

When I run this code:

Code: Select all

echo count(get_html_translation_table(HTML_ENTITIES));
I only get 100 entities returned

However this page http://en.wikipedia.org/wiki/List_of_XM ... es_in_HTML says there should be 253 total valid XHTML entity codes

Is there a copy & paste or serialised PHP array anywhere I can use which has that full list of 253 entity codes and their UTF8 characters?

Cheers, B
Last edited by batfastad on Thu Jul 02, 2009 11:23 am, edited 2 times in total.
Eric!
DevNet Resident
Posts: 1146
Joined: Sun Jun 14, 2009 3:13 pm

Re: get_html_translation_table only returns 100 entities

Post by Eric! »

The php functions aren't complete for some reason. You can search around the web but I haven't seen any complete functions. There are some bits that people have posted in the manual for
http://php.net/htmlentities
BornForCode
Forum Contributor
Posts: 147
Joined: Mon Feb 11, 2008 1:56 am

Re: get_html_translation_table only returns 100 entities

Post by BornForCode »

There are two new constants (HTML_ENTITIES, HTML_SPECIALCHARS) that allow you to specify the table you want.

For utf8 try:

Code: Select all

 
function translation_table_to_utf8($arTranslationtable)
{
    //loop through the array and convert everything both keys and values
    foreach($arTranslationtable as $charkey => $char)
    {
        $charkey = utf8_encode($charkey);
        $arUTFchars[$charkey]= utf8_encode($char);
    } 
     return $arUTFchars;
}
 
//get the translation table
$arSpecialchar     = get_html_translation_table(HTML_ENTITIES);
 
//call the function to convert to utf-8
$arUTFchars = translation_table_to_utf8($arSpecialchar);
print_r($arUTFchars);
 
 
Eric!
DevNet Resident
Posts: 1146
Joined: Sun Jun 14, 2009 3:13 pm

Re: get_html_translation_table only returns 100 entities

Post by Eric! »

Sorry, this is the page with example code for values not in the tables
http://cr.php.net/manual/en/function.ge ... -table.php
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Re: get_html_translation_table only returns 100 entities

Post by batfastad »

@BornForCode... yeah I tried the HTML_ENTITIES constant (see my code above) but it still doesn't return all 252 (HTML) or 253 (XHTML)
That code you posted still only seems to return the 100 entities listed in get_html_translation_table()

@Eric
EDIT: Ah ok, yes those comments by Chris in the manual contain the missing characters. Thanks for the info.

I can combine that with get_html_translation_table() and convert all text entities into their numeric codes.
But then what's the best way to convert numeric codes to their actual characters?
It seems the mb_decode_numericentity() function only handles decimal numeric codes (&#NNNN;) and not the hex format (&#xNNNN;)

Any ideas?
Cheers, B
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Re: get_html_translation_table only returns 100 entities

Post by batfastad »

Right after a day of faffing about, here we are!

Here's a PHP variable for all XHTML entities with their equivalent numeric char reference:

Code: Select all

$entities_xhtml = array('"'=>'"', '<'=>'<', '>'=>'>', '&'=>'&', '&nbsp;'=>'&#160;', '&apos;'=>''', '&iexcl;'=>'&#161;', '&cent;'=>'&#162;', '&pound;'=>'&#163;', '&curren;'=>'&#164;', '&yen;'=>'&#165;', '&brvbar;'=>'&#166;', '&sect;'=>'&#167;', '&uml;'=>'&#168;', '&copy;'=>'&#169;', '&ordf;'=>'&#170;', '&laquo;'=>'&#171;', '&not;'=>'&#172;', '&shy;'=>'&#173;', '&reg;'=>'&#174;', '&macr;'=>'&#175;', '&deg;'=>'&#176;', '&plusmn;'=>'&#177;', '&sup2;'=>'&#178;', '&sup3;'=>'&#179;', '&acute;'=>'&#180;', '&micro;'=>'&#181;', '&para;'=>'&#182;', '&middot;'=>'&#183;', '&cedil;'=>'&#184;', '&sup1;'=>'&#185;', '&ordm;'=>'&#186;', '&raquo;'=>'&#187;', '&frac14;'=>'&#188;', '&frac12;'=>'&#189;', '&frac34;'=>'&#190;', '&iquest;'=>'&#191;', '&Agrave;'=>'&#192;', '&Aacute;'=>'&#193;', '&Acirc;'=>'&#194;', '&Atilde;'=>'&#195;', '&Auml;'=>'&#196;', '&Aring;'=>'&#197;', '&AElig;'=>'&#198;', '&Ccedil;'=>'&#199;', '&Egrave;'=>'&#200;', '&Eacute;'=>'&#201;', '&Ecirc;'=>'&#202;', '&Euml;'=>'&#203;', '&Igrave;'=>'&#204;', '&Iacute;'=>'&#205;', '&Icirc;'=>'&#206;', '&Iuml;'=>'&#207;', '&ETH;'=>'&#208;', '&Ntilde;'=>'&#209;', '&Ograve;'=>'&#210;', '&Oacute;'=>'&#211;', '&Ocirc;'=>'&#212;', '&Otilde;'=>'&#213;', '&Ouml;'=>'&#214;', '&times;'=>'&#215;', '&Oslash;'=>'&#216;', '&Ugrave;'=>'&#217;', '&Uacute;'=>'&#218;', '&Ucirc;'=>'&#219;', '&Uuml;'=>'&#220;', '&Yacute;'=>'&#221;', '&THORN;'=>'&#222;', '&szlig;'=>'&#223;', '&agrave;'=>'&#224;', '&aacute;'=>'&#225;', '&acirc;'=>'&#226;', '&atilde;'=>'&#227;', '&auml;'=>'&#228;', '&aring;'=>'&#229;', '&aelig;'=>'&#230;', '&ccedil;'=>'&#231;', '&egrave;'=>'&#232;', '&eacute;'=>'&#233;', '&ecirc;'=>'&#234;', '&euml;'=>'&#235;', '&igrave;'=>'&#236;', '&iacute;'=>'&#237;', '&icirc;'=>'&#238;', '&iuml;'=>'&#239;', '&eth;'=>'&#240;', '&ntilde;'=>'&#241;', '&ograve;'=>'&#242;', '&oacute;'=>'&#243;', '&ocirc;'=>'&#244;', '&otilde;'=>'&#245;', '&ouml;'=>'&#246;', '&divide;'=>'&#247;', '&oslash;'=>'&#248;', '&ugrave;'=>'&#249;', '&uacute;'=>'&#250;', '&ucirc;'=>'&#251;', '&uuml;'=>'&#252;', '&yacute;'=>'&#253;', '&thorn;'=>'&#254;', '&yuml;'=>'&#255;', '&minus;'=>'&#8722;', '&circ;'=>'&#710;', '&tilde;'=>'&#732;', '&Scaron;'=>'&#352;', '&lsaquo;'=>'&#8249;', '&OElig;'=>'&#338;', '&lsquo;'=>'&#8216;', '&rsquo;'=>'&#8217;', '&ldquo;'=>'&#8220;', '&rdquo;'=>'&#8221;', '&bull;'=>'&#8226;', '&ndash;'=>'&#8211;', '&mdash;'=>'&#8212;', '&trade;'=>'&#8482;', '&scaron;'=>'&#353;', '&rsaquo;'=>'&#8250;', '&oelig;'=>'&#339;', '&Yuml;'=>'&#376;', '&fnof;'=>'&#402;', '&Alpha;'=>'&#913;', '&Beta;'=>'&#914;', '&Gamma;'=>'&#915;', '&Delta;'=>'&#916;', '&Epsilon;'=>'&#917;', '&Zeta;'=>'&#918;', '&Eta;'=>'&#919;', '&Theta;'=>'&#920;', '&Iota;'=>'&#921;', '&Kappa;'=>'&#922;', '&Lambda;'=>'&#923;', '&Mu;'=>'&#924;', '&Nu;'=>'&#925;', '&Xi;'=>'&#926;', '&Omicron;'=>'&#927;', '&Pi;'=>'&#928;', '&Rho;'=>'&#929;', '&Sigma;'=>'&#931;', '&Tau;'=>'&#932;', '&Upsilon;'=>'&#933;', '&Phi;'=>'&#934;', '&Chi;'=>'&#935;', '&Psi;'=>'&#936;', '&Omega;'=>'&#937;', '&alpha;'=>'&#945;', '&beta;'=>'&#946;', '&gamma;'=>'&#947;', '&delta;'=>'&#948;', '&epsilon;'=>'&#949;', '&zeta;'=>'&#950;', '&eta;'=>'&#951;', '&theta;'=>'&#952;', '&iota;'=>'&#953;', '&kappa;'=>'&#954;', '&lambda;'=>'&#955;', '&mu;'=>'&#956;', '&nu;'=>'&#957;', '&xi;'=>'&#958;', '&omicron;'=>'&#959;', '&pi;'=>'&#960;', '&rho;'=>'&#961;', '&sigmaf;'=>'&#962;', '&sigma;'=>'&#963;', '&tau;'=>'&#964;', '&upsilon;'=>'&#965;', '&phi;'=>'&#966;', '&chi;'=>'&#967;', '&psi;'=>'&#968;', '&omega;'=>'&#969;', '&thetasym;'=>'&#977;', '&upsih;'=>'&#978;', '&piv;'=>'&#982;', '&ensp;'=>'&#8194;', '&emsp;'=>'&#8195;', '&thinsp;'=>'&#8201;', '&zwnj;'=>'&#8204;', '&zwj;'=>'&#8205;', '&lrm;'=>'&#8206;', '&rlm;'=>'&#8207;', '&sbquo;'=>'&#8218;', '&bdquo;'=>'&#8222;', '&dagger;'=>'&#8224;', '&Dagger;'=>'&#8225;', '&hellip;'=>'&#8230;', '&permil;'=>'&#8240;', '&prime;'=>'&#8242;', '&Prime;'=>'&#8243;', '&oline;'=>'&#8254;', '&frasl;'=>'&#8260;', '&euro;'=>'&#8364;', '&image;'=>'&#8465;', '&weierp;'=>'&#8472;', '&real;'=>'&#8476;', '&alefsym;'=>'&#8501;', '&larr;'=>'&#8592;', '&uarr;'=>'&#8593;', '&rarr;'=>'&#8594;', '&darr;'=>'&#8595;', '&harr;'=>'&#8596;', '&crarr;'=>'&#8629;', '&lArr;'=>'&#8656;', '&uArr;'=>'&#8657;', '&rArr;'=>'&#8658;', '&dArr;'=>'&#8659;', '&hArr;'=>'&#8660;', '&forall;'=>'&#8704;', '&part;'=>'&#8706;', '&exist;'=>'&#8707;', '&empty;'=>'&#8709;', '&nabla;'=>'&#8711;', '&isin;'=>'&#8712;', '&notin;'=>'&#8713;', '&ni;'=>'&#8715;', '&prod;'=>'&#8719;', '&sum;'=>'&#8721;', '&lowast;'=>'&#8727;', '&radic;'=>'&#8730;', '&prop;'=>'&#8733;', '&infin;'=>'&#8734;', '&ang;'=>'&#8736;', '&and;'=>'&#8743;', '&or;'=>'&#8744;', '&cap;'=>'&#8745;', '&cup;'=>'&#8746;', '&int;'=>'&#8747;', '&there4;'=>'&#8756;', '&sim;'=>'&#8764;', '&cong;'=>'&#8773;', '&asymp;'=>'&#8776;', '&ne;'=>'&#8800;', '&equiv;'=>'&#8801;', '&le;'=>'&#8804;', '&ge;'=>'&#8805;', '&sub;'=>'&#8834;', '&sup;'=>'&#8835;', '&nsub;'=>'&#8836;', '&sube;'=>'&#8838;', '&supe;'=>'&#8839;', '&oplus;'=>'&#8853;', '&otimes;'=>'&#8855;', '&perp;'=>'&#8869;', '&sdot;'=>'&#8901;', '&lceil;'=>'&#8968;', '&rceil;'=>'&#8969;', '&lfloor;'=>'&#8970;', '&rfloor;'=>'&#8971;', '&lang;'=>'&#9001;', '&rang;'=>'&#9002;', '&loz;'=>'&#9674;', '&spades;'=>'&#9824;', '&clubs;'=>'&#9827;', '&hearts;'=>'&#9829;', '&diams;'=>'&#9830;');
 
$entities_xhtml_preserve = array('"'=>'"', '<'=>'<', '>'=>'>', '&'=>'&', '&nbsp;'=>'&#160;');
My next step was to convert all text entity codes to their numeric equivalents, apart from $entities_xhtml_preserve above... I want the user of the CMS to be able to enter the text entities of < > & " and &nbsp;
But all other text entities are converted to numerics with this:

Code: Select all

// convert entities to numerics... preserve selected
foreach ($entities_xhtml_preserve as $key => $value) {
    unset($entities_xhtml[$key]);
}
$var = str_replace( array_keys($entities_xhtml), array_values($entities_xhtml), $var);
Final step is to convert all decimal numeric entities and hex numeric entities to their proper UTF8 character for saving into my DB:

Code: Select all

function numeric_entities_to_chars($var) {
 
    //callback function for the regex
    function utf8_entity_decode($entity){
        $convmap = array(0x0, 0x10000, 0, 0xfffff);
        return mb_decode_numericentity($entity, $convmap, 'UTF-8');
    }
 
    //decode decimal
    $var = preg_replace('/&#\d{2,5};/ue', "utf8_entity_decode('$0')", $var);
    //decode hex
    $var = preg_replace('/&#x([a-fA-F0-9]{2,8});/ue', "utf8_entity_decode('&#'.hexdec('$1').';')", $var);
 
    return $var;
}
 
echo numeric_entities_to_chars($var);
My thanks to Andrew Simpson's function in his comment on the mb_decode_numericentity() man page (http://uk.php.net/manual/en/function.mb ... entity.php)... saved many hours!

Hope this helps someone out ;)
Eric!
DevNet Resident
Posts: 1146
Joined: Sun Jun 14, 2009 3:13 pm

Re: get_html_translation_table only returns 100 entities

Post by Eric! »

batfastad thanks for sharing all your work on this, very nice of you. :drunk:
Post Reply