HTML numeric entities - different ones for same chars??
Posted: Wed Oct 18, 2006 7:30 am
Hi guys
I've been struggling with this issue for a couple of weeks now.
I'm converting a database from FileMaker to MySQL but the output I get from FileMaker has any funny characters already encoded to HTML numeric entities.
I'm trying to convert these numeric entities back to their characters - the script is made so I end up with a text file which is a bunch of SQL INSERT statements that I can just paste into phpMyAdmin as an SQL command.
The following functions I run on the data are...
But I've noticed some of the entities coming out of the FileMaker XML output are non-standard...
All the above entities obviously without the periods.
And probably others, as the field in question is a notes field where people could enter pretty much any chars in the FileMaker database.
Now when I run those above PHP functions on my string, I get a quote mark " instead of the TM, and a ¬ character instead of the Euro sign.
So the numericentitieshtml() and html_entity_decode() do not understand the entities above - the &.#8xxx ones.
So my question is, why are there different entity codes for the same characters?
Is there any way for me to convert the &.#8xxx ones to the &.#1xx codes above?
Interestingly however, if I have a form and a script to process the form...
Then the &.#8482; in the hidden form variable, gets sent to the output.php script and output as the superscript TM character, then running the htmlnumericentities() function correctly converts the superscript TM to &.#153;
So it appears that somewhere either in PHP processing the form, or the browser (Firefox in this case) sending the data, that there is a conversion going on with these non-standard &.#8xxx entity codes to the standard &.#1xx codes that PHP understands.
Does anyone have any idea what I'm on about here??
This has totally confused me over the past 2 weeks so typing this out has pretty much fried my brain!
Any ideas?
Thanks
Ben
I've been struggling with this issue for a couple of weeks now.
I'm converting a database from FileMaker to MySQL but the output I get from FileMaker has any funny characters already encoded to HTML numeric entities.
I'm trying to convert these numeric entities back to their characters - the script is made so I end up with a text file which is a bunch of SQL INSERT statements that I can just paste into phpMyAdmin as an SQL command.
The following functions I run on the data are...
Code: Select all
function numericentitieshtml($str) {
return preg_replace('/&#(\d+);/e', 'chr(str_replace(";", "", str_replace("&#","","$0")))', $str);
}
$string = strip_tags($string); // REMOVE ANY HTML FROM OUR STRING
$string = html_entity_decode($string, ENT_QUOTES); //CONVERT ALPHA HTML ENTITIES TO CHARS - LIKE &.trade, &.copy ETC
$string = numericentitieshtml($string); //CONVERT NUMERIC HTML ENTITIES TO CHARS - &.#128;Code: Select all
Euro symbol as &.#8364; instead of &.#128;
Trademark symbol as &.#8482; instead &.#153;
3 dots as one character like ... comes out as &.#8230; instead of &.#133;And probably others, as the field in question is a notes field where people could enter pretty much any chars in the FileMaker database.
Now when I run those above PHP functions on my string, I get a quote mark " instead of the TM, and a ¬ character instead of the Euro sign.
So the numericentitieshtml() and html_entity_decode() do not understand the entities above - the &.#8xxx ones.
So my question is, why are there different entity codes for the same characters?
Is there any way for me to convert the &.#8xxx ones to the &.#1xx codes above?
Interestingly however, if I have a form and a script to process the form...
Code: Select all
// FORM
<html>
<head>
</head>
<body>
<form action="output.php" method="post">
<input type="hidden" name="string" value="™" />
<input type="submit" name="submit" value="submit">
</form>
</body>
</html>
// OUTPUT PHP SCRIPT
<?php
// CONVERT CHARS TO HTML NUMERIC ENTITIES
function htmlnumericentities($str) {
return preg_replace('/[^!-%\x27-;=?-~ ]/e', '"&#".ord("$0").chr(59)', $str);
}
$string = $_POST['string'];
header('Content-type: text/plain; charset=iso-8859-1');
echo $string."\r\n";
$string2 = htmlnumericentities($string);
echo $string2."\r\n";
?>So it appears that somewhere either in PHP processing the form, or the browser (Firefox in this case) sending the data, that there is a conversion going on with these non-standard &.#8xxx entity codes to the standard &.#1xx codes that PHP understands.
Does anyone have any idea what I'm on about here??
This has totally confused me over the past 2 weeks so typing this out has pretty much fried my brain!
Any ideas?
Thanks
Ben