Page 1 of 1

Escaping characters for writing to xml

Posted: Sun Jan 21, 2007 10:39 pm
by zirconx
I'm working on a configuration page that reads and writes config data from an xml file.

I am running into problems with special characters. I know XML data can't contain ampersands and the less than sign. I had problems when I just ran the data through htmlentities(). My copyright symbol would get written out as © , but then I get PHP errors when I try to read the file back in. Running the data through htmlentities() twice seems to work well. Then I end up with © which I think gets turned into © when the XML is read into the SimpleXML object.

Am I doing this right, or is there a better way? It even seems to work right if I put in an ampersand as input, surprisingly.

I also came across this function on the php.net comments:

Code: Select all

function htmlnumericentities($str){
    return preg_replace('/[^!-%\x27-;=?-~ ]/e', '"&#".ord("$0").chr(59)', $str);
};
Which turns the copyright symbol into © which is exactly what Apple's site gives an example to use for the copyright symbol in iTunes-RSS XML files.

But when I load that up in my edit page, I end up with a funny character (Â) just before the copyright symbol. If I manually edit the xml file and change it to © then it works OK. Man this is confusing! Maybe I should just put everything into CDATA! But that makes for an ugly file when you have to hand edit it.

Thanks for any help.

Posted: Mon Jan 22, 2007 3:44 pm
by Christopher
Not sure if you had a question, but yes there is some monkeying around to get this kind of thing to work. You might also look into registering those meta characters with the XML parser (depending on which one you are using) so they are allowed.

Posted: Mon Jan 22, 2007 4:53 pm
by zirconx
I was just looking for some input on escaping for XML, and if I was doing it right. In all my searching I haven't found anyone saying to just run the data through htmlentities() twice, but it seems to work well.

Posted: Mon Jan 22, 2007 7:21 pm
by Ambush Commander
The short answer: If you want things to be as hassle free as possible, use a library like DOM or xml for WRITING data as well as reading it in.

The long answer: Unless you explicitly state so using an external entity file definition inclusion section (generally not a good idea), you can NOT use named character entities like © Furthermore, you may NOT directly concatenate data into an XML string unless it is UTF-8 (or you changed the encoding of the XML which is essentially unheard of).

The steps:
1. Transform all named entities into their real characters using html_entity_decode() or use HTMLPurifier_EntityParser->escapeNonSpecialEntities with these classes: EntityParser, EntityLookup and a serialized lookup array of all the entities.
2. Use utf8_encode() if you're dealing with ISO 8895-1, or iconv, converting to UTF-8
3. Then use htmlspecialchars() on that part to escape special characters like angled brackets and quotes
4. Finally you can stuff it in the XML.

Posted: Mon Jan 22, 2007 7:50 pm
by zirconx
Thanks for the reply. Yes I am using a DOMDocument object to write my XML.

Questions about your steps 2 and 3:

I am creating this app from scratch so I will plan on having everything in UTF8. My understanding is that if I make sure my page has a header identifying it as UTF8, then when the user submits there data from the form it will be in UTF8. Correct?

And you say to use htmlspecialchars(), but wouldn't that leave me with things like > ? I thought this wasn't allowed in XML. Do I need to run it through htmlspecialchars() twice so I end up with > ?

Thanks.

Posted: Mon Jan 22, 2007 7:55 pm
by Ambush Commander
There are five special named entities that do work in XML: amp, quot, lt, gt and apos. They always work.

However, you should be cognizant of the fact that DOMText($text) and other variants automatically do the escaping for you. In this case, no escaping at all is necessary.

Regarding your UTF-8 question, yes.

Posted: Mon Jan 22, 2007 11:09 pm
by zirconx
I am populating the DOMDocument object from a SimpleXML object. The SimpleXML object gets populated from the XML config file.

I take the user's input from the form, load it into a few places in the SimpleXML object, then import the SimpleXML object into a new DOMDocument object, then use $dom->save to write out the XML. I'm using SimpleXML for reading/using/updating-in-memory because its a lot easier to use than DOMDocument. But as far as I could tell, SimpleXML cannot write XML back out, so I had to convert it to DOMDocument.

Anyway, when I update the SimpleXML object I am running stuff through htmlentities twice, like this:

$xml->channel->title = htmlentities(htmlentities($_POST['title']));

I wasn't aware of the DOMText function. Perhaps instead I should be doing:

$xml->channel->title = DOMText(($_POST['title']);

Posted: Tue Jan 23, 2007 4:01 pm
by Ambush Commander
SimpleXML and DOM are different extensions. Even though the support transfers data-structures from one another, you can't mix the two together, and DOMText is part of DOM. I'm not so sure about SimpleXML's structure, sorry.

BTW: DOMText is an object, and you can use ->asXML() to convert SimpleXML to a string.

Posted: Tue Jan 23, 2007 8:20 pm
by zirconx
Wow, that ->asXML() is helpful. I looked all over the SimpleXML documentation for a way to write the XML back out, but couldn't find it.

Thanks.