Page 1 of 1

HTML numeric entities - different ones for same chars??

Posted: Wed Oct 18, 2006 7:30 am
by batfastad
Hi guys

I've been struggling with this issue for a couple of weeks now.
I'm converting a database from FileMaker to MySQL but the output I get from FileMaker has any funny characters already encoded to HTML numeric entities.

I'm trying to convert these numeric entities back to their characters - the script is made so I end up with a text file which is a bunch of SQL INSERT statements that I can just paste into phpMyAdmin as an SQL command.

The following functions I run on the data are...

Code: Select all

function numericentitieshtml($str) {
        return preg_replace('/&#(\d+);/e', 'chr(str_replace(";", "", str_replace("&#","","$0")))', $str);
}

$string = strip_tags($string); // REMOVE ANY HTML FROM OUR STRING
$string = html_entity_decode($string, ENT_QUOTES); //CONVERT ALPHA HTML ENTITIES TO CHARS - LIKE &.trade, &.copy ETC
$string = numericentitieshtml($string); //CONVERT NUMERIC HTML ENTITIES TO CHARS - &.#128;
But I've noticed some of the entities coming out of the FileMaker XML output are non-standard...

Code: Select all

Euro symbol as &.#8364; instead of &.#128;
Trademark symbol as &.#8482; instead &.#153;
3 dots as one character like ... comes out as &.#8230; instead of &.#133;
All the above entities obviously without the periods.

And probably others, as the field in question is a notes field where people could enter pretty much any chars in the FileMaker database.

Now when I run those above PHP functions on my string, I get a quote mark " instead of the TM, and a ¬ character instead of the Euro sign.
So the numericentitieshtml() and html_entity_decode() do not understand the entities above - the &.#8xxx ones.

So my question is, why are there different entity codes for the same characters?
Is there any way for me to convert the &.#8xxx ones to the &.#1xx codes above?


Interestingly however, if I have a form and a script to process the form...

Code: Select all

// FORM
<html>
<head>
</head>
<body>

<form action="output.php" method="post">
<input type="hidden" name="string" value="&#8482;" />
<input type="submit" name="submit" value="submit">
</form>

</body>
</html>


// OUTPUT PHP SCRIPT
<?php

// CONVERT CHARS TO HTML NUMERIC ENTITIES
function htmlnumericentities($str) {
        return preg_replace('/[^!-%\x27-;=?-~ ]/e', '"&#".ord("$0").chr(59)', $str);
}

$string = $_POST['string'];

header('Content-type: text/plain; charset=iso-8859-1');

echo $string."\r\n";

$string2 = htmlnumericentities($string);
echo $string2."\r\n";
?>
Then the &.#8482; in the hidden form variable, gets sent to the output.php script and output as the superscript TM character, then running the htmlnumericentities() function correctly converts the superscript TM to &.#153;

So it appears that somewhere either in PHP processing the form, or the browser (Firefox in this case) sending the data, that there is a conversion going on with these non-standard &.#8xxx entity codes to the standard &.#1xx codes that PHP understands.

Does anyone have any idea what I'm on about here??
This has totally confused me over the past 2 weeks so typing this out has pretty much fried my brain!

Any ideas?


Thanks

Ben

Posted: Wed Oct 18, 2006 7:54 am
by batfastad
Furthermore on this Wikipedia entry...
http://en.wikipedia.org/wiki/List_of_XM ... references

The entity references in the range &.#8xxx are the ones that are mentioned, and &.#128;

Can anyone explain this to me?
It's driving me nuts 8O

Posted: Wed Oct 18, 2006 11:53 am
by feyd
chr() can only process the number up to 255 in decimal to an actual character. Beyond that, you're into the extended character sets.

Why not leave them as HTML entities?

Posted: Wed Oct 18, 2006 12:27 pm
by batfastad
Hi feyd

Ideally I'd like to keep the content of the fields as the pure text, rather than HTML-ised output.
However they will be output as HTML probably 99% of the time - it's an intranet contacts database system.
But I think I'll probably store the data as the pure text, without entities. Though that would have been the backup plan if I couldn't get this going.

Funny you should suggest that though, as this afternoon I had a great idea of rather than outputting that export file as text/plain, if I output it as text/html then the entities get rendered as their characters anyway.
Then I just copy that into the phpMyAdmin SQL command window, and import all my records.

So I managed to get round this problem by letting the browser do the rendering/converting of the entities for me, then using the result of that.


Thanks

Ben