Page 1 of 1

XML encoding

Posted: Sun Jan 31, 2010 6:11 am
by Jemt
Hello developers.

I have a problem with storing special characters in XML.
Until PHP gets native support for UTF-8 (no need to use mb_* functions), I will be using ISO-8859-1. Unfortunately today I found out, that non-compatible characters corrupts my XML files (ie the Euro sign (€)) - example:

Code: Select all

 
$xml = new DOMDocument();
$xml->loadXML("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><root></root>");
$element = $xml->createElement("data");
 
$newAttribute = $xml->createAttribute("name");
$newTextNode = $xml->createTextNode(htmlentities("€"));
 
$newAttribute->appendChild($newTextNode);
$element->appendChild($newAttribute);
 
$xml->documentElement->appendChild($element);
echo $xml->saveXML(); // XML is corrupted, caused by the Euro sign
 
The result I would expect:

Code: Select all

 
<?xml version="1.0" encoding="ISO-8859-1"?>
<root><data name="&euro;" /></root>
 
The result I actually get (corrupted):

Code: Select all

 
<?xml version="1.0" encoding="ISO-8859-1"?>
<root><data name="&acirc;
 
I guess it makes sense, since the euro sign is not part of ISO-8859-1, but ISO-8859-15, so I tried saving the XML as ISO-8859-15, and invoked the htmlentities function like this: htmlentities("€", ENT_COMPAT, "ISO-8859-15")

Strangely enough, I still get corrupted XML. I have to change the XML file to UTF-8 for it to work, and I'm not interested in that.

Even though I succeed and get the euro sign to work, I will still get problems with a lot of other characters from the UTF-8 charset, so I figured detecting the encoding was the proper solution, and simply thrown an exception, if invalid data was being "submitted" to the XML.

I tried mb_detect_encoding(..) to determine the encoding of the data, but it seems to be very buggy.

Code: Select all

 
echo mb_detect_encoding("Here is a € (euro) sign", "ISO-8859-1, ISO-8859-15, UTF-8");
 
The code above is supposed to yield "ISO-8859-15", but what I get is "ISO-8859-1", which is not correct, so I can't use it to detect the encoding.

At this point I would very much appreciate any input, that can help me solve this problem. Again, I only want to save data as ISO-8859-1. htmlentities should take care of encoding ie the euro sign to entities that are compatible with ISO-8859-1, and I would like to be able to simply throw an exception, if invalid data is submitted to the XML file.

Thanks in advance! :-)

Re: XML encoding

Posted: Sun Jan 31, 2010 8:54 am
by Jemt
The best solution I have been able to come up with, is using utf8_decode($str). This will replace all characters that are not compatible with ISO-8859-1, with a question mark (?). People using this PHP application knows, that it is build for ISO-8859-1 data only, so this solution is acceptable. At least it is far better than the XML files being corrupted.

Hopefully this answer will be helpful to someone else :-)

Re: XML encoding

Posted: Sun Jan 31, 2010 10:14 am
by requinix
So why can't you use UTF-8 for the XML?

Re: XML encoding

Posted: Sun Jan 31, 2010 3:45 pm
by Jemt
Hello again.

Unfortunately my solution didn't work as expected. Instead I wrote a function, which is supposed to work, but only works consistently on some servers.

Code: Select all

 
function IsLatin1($str)
{
    return (preg_match("/^[\\x00-\\xFF]*$/u", $str) === 1);
}
 
This is becoming really frustrating. I'm working on a pretty large applications, which unfortunately contains a bug which results in corrupted XML files. Adding support for MultiByte encoding will take months, so for now, I just need a fix that will prevent MultiByte characters from being "committed" to the DOMDocument, and by that the XML file.

I'm open for suggestions allowing me to use ISO-8859-1. It could be a function removing or replacing non-ISO-8859-1 characters, or simply detecting them, so I can output an error message.

Thanks in advance

Re: XML encoding

Posted: Mon Feb 01, 2010 12:23 am
by Jemt
Hello again.

I did a quick test this morning:

Code: Select all

 
<?
 
header("content-type: text/html;charset=ISO-8859-1");
 
$str = $_POST["input"];
 
$str = utf8_encode($str);
$str = htmlentities($str, ENT_COMPAT, "UTF-8");
 
$str = utf8_decode($str);
$str = html_entity_decode($str);
 
echo $str;
 
?>
 
<html>
<head>
<title>Test</title>
</head>
<body>
 
<form action="encoding.php" method="post">
        <input name="input">
        <input type="submit">
</form>
 
</body>
</html>
 
Notice how I make sure the page is marked as ISO-8859-1 using header(..).

The odd thing is, if I submit the euro sign (€ - ISO-8859-15), it gets written on the page just fine. I would expect it to become a question mark.

Also, it seems that data is posted from the client as UTF-8 - is this correct, even though the page is sent as ISO-8859-1 ?

Re: XML encoding

Posted: Mon Feb 01, 2010 11:55 pm
by Jemt
It seems I got two options at this point. Either I accept that the DOMDocument works better with UTF-8, and simply make sure data is encoded to UTF-8 before going into the DOM, and decoded back to ISO-8859-1, when extracted from the DOM, or I re-parse the XML document before saving it, to make sure it hasn't been corrupted:

Code: Select all

 
$xml = new DOMDocument();
$xml->loadXML("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><root></root>");
$element = $xml->createElement("data");
 
$newAttribute = $xml->createAttribute("name");
$newTextNode = $xml->createTextNode(htmlentities("€"));
 
$newAttribute->appendChild($newTextNode);
$element->appendChild($newAttribute);
 
$xml->documentElement->appendChild($element);
$xmlStr = $xml->saveXML(); // XML is corrupted, caused by the Euro sign
 
$res = $xml->loadXML($xmlStr); // Parse again to make sure XML is valid
 
if ($res === true)
        // Save to file