Page 1 of 1
Character escaping for XML (RSS) output
Posted: Thu Mar 04, 2010 7:16 am
by batfastad
Hi everyone
Obviously when outputting anything to the browser in HTML format, it should be properly escaped so that the page displays/validates correctly.
When outputting HTML I use the following function:
Code: Select all
function html_clean($var) {
return htmlspecialchars($var, ENT_COMPAT, 'UTF-8');
}
All my databases and websites are in UTF-8 these days so the only characters I worry about converting to entities are the HTML special characters.
However for our intranet I've finally got round to building some RSS feeds of various data.
What characters do I need to escape for XML?
Obviously the html special characters... & < > " and ' have to be done for XML as well.
And I'm under the impression that apostrophe is optional for escaping in XHTML. We make sure that all our tags/attributes use " rather than ' anyway.
But do I need to convert other characters eg: non English characters to numeric entities for XML (RSS) output?
Or can I just leave them as native UTF-8 as they come out of our database?
Cheers, B
Re: Character escaping for XML (RSS) output
Posted: Thu Mar 04, 2010 9:01 am
by Weirdan
batfastad wrote:
But do I need to convert other characters eg: non English characters to numeric entities for XML (RSS) output?
Or can I just leave them as native UTF-8 as they come out of our database?
Any compatible parser is required to support UTF-8, as well as full repertoire of Unicode characters (except for surrogate blocks, FFFF and FFFE). The only characters that are forbidden in attribute values are '<&", the only characters that are forbidden in node contents are < and & (except when used to start another tag or reference an entity).
Re: Character escaping for XML (RSS) output
Posted: Thu Mar 04, 2010 9:26 am
by batfastad
I'm not using a parser, just looping through records and outputting the XML straight from PHP.
I thought I'd only need a parser if I wanted to manipulate existing XML in some way?
So inside an XML node, I don't need to convert to entities " and ' ?
For accented and non-English characters, do I need to convert those to numeric entity references or can I just output them as UTF-8 characters (I've got UTF-8 encoding set for my feed)?
Re: Character escaping for XML (RSS) output
Posted: Thu Mar 04, 2010 9:31 am
by Weirdan
batfastad wrote:I'm not using a parser, just looping through records and outputting the XML straight from PHP.
I thought I'd only need a parser if I wanted to manipulate existing XML in some way?
But anyone consuming your feed
is using a parser - and thus those parsers must support UTF-8.
batfastad wrote:
So inside an XML node, I don't need to convert to entities " and ' ?
Right.
batfastad wrote:For accented and non-English characters, do I need to convert those to numeric entity references or can I just output them as UTF-8 characters (I've got UTF-8 encoding set for my feed)?
You don't need to convert them if your feed is using UTF-8.
Re: Character escaping for XML (RSS) output
Posted: Thu Mar 04, 2010 9:47 am
by batfastad
Fair enough on the parsers thing, didn't think of it that way round.
Do most parsers support UTF-8?
Great news on the entities thing, that's what I was hoping. So I can just tweak my html_clean function for that.
Ok a more RSS-specific question...
Inside the <description></description> node of an RSS feed item, can I have basic HTML code in there? Links, bold, line breaks etc?
Should I use: <![CDATA[ ]]> round the HTML content?
Thanks for all the info so far!

Cheers, B
Re: Character escaping for XML (RSS) output
Posted: Thu Mar 04, 2010 10:11 am
by Weirdan
batfastad wrote:Do most parsers support UTF-8?
Most (if not all) do. The end-user system (if you intend your feed to be read by humans) may lack appropriate fonts though, so it's wise to avoid very exotic characters. Using moderately exotic characters like Cyrillic, Hebrew and Greek tends to be safe, though some fonts lack even those.
batfastad wrote:Ok a more RSS-specific question...
Can't help you here, I'm not very familiar with RSS.