Page 1 of 1

xml_parse eats < and > from cdata - PHP/libxml bug

Posted: Sun Sep 13, 2009 11:14 am
by jørgen
I am trying to debug a problem with SimplePie (RSS/ATOM feed parser) as used in Joomla! 1.5.14 (latest). Having identical (out of the box) installations on different hosting providers I notice a strange problem with the XML parser (as used in SimplePie). I have absolutely no experience with the XML parser used in PHP by the way.

On some installations xml_parse removes '<' and '>' found in cdata (which is not good when sending the news feed description to the browser). On other installations '<' and '>' are translated to '<' and '>' as expected.

As far as I can see the installations are identical except for the libXML and PHP version numbers (libXML version 2.6.27 (PHP Version 5.2.10) on installations working OK and libXML version 2.7.3 (PHP Version 5.2.8 ) on installitions having problems).

xml_parser_create_ns is used to create the parser (encoding=UTF-8, separator= ' '). OPTION_SKIP_WHITE=1, XML_OPTION_CASE_FOLDING=0.

Here is a detailed example. The input to xml_parse is always the same (extract):

Code: Select all

<description><p><a href="http://www.packtpub.com/nominate-best-open-source-php-cms">  ....  
On systems that is working OK, the "character data handler" function (as configured by xml_set_character_data_handler) receives the following cdata fragments (in its second parameter "string $data"):

Code: Select all

(SimplePie_Parser::tag_open tag: description - attributes: a:0:{})
SimplePie_Parser::cdata: '<'
SimplePie_Parser::cdata: 'p'
SimplePie_Parser::cdata: '>'
SimplePie_Parser::cdata: '<'
SimplePie_Parser::cdata: 'a href="http://www.packtpub.com/nominate-best-open-source-php-cms"'
SimplePie_Parser::cdata: '>'
This yields valid HTML: <p><a href="http://www.packtpub.com/nominate-best-o ... ce-php-cms">

On installations having problems it looks like this:

Code: Select all

(SimplePie_Parser::tag_open tag: description - attributes: a:0:{})
SimplePie_Parser::cdata: 'p'
SimplePie_Parser::cdata: 'a href="http://www.packtpub.com/nominate-best-open-source-php-cms"'
As can be seen, fewer calls and the '<' and '>' are just gone!

Everything else (Joomla! etc.) works OK by the way...

Any idea why this happens?

Re: xml_parse eats &lt; and &gt; from cdata

Posted: Mon Sep 14, 2009 6:46 am
by jørgen
After some further investigations, this turns out to be a PHP / libxml bug. It affects some installations only:

libxml 2.7.x on PHP < 5.2.9 and
libxml 2.7.0 to 2.7.2 on any PHP version

http://bugs.php.net/bug.php?id=45996
http://bugs.gentoo.org/show_bug.cgi?id=249703
http://blog.code-head.com/fixing-libxml ... ing-libxml
http://blog.code-head.com/fixing-libxml ... s-libexpat
https://glowhost.com/forums/general-sup ... -1574.html
https://bugzilla.redhat.com/show_bug.cgi?id=467314

Newer versions of SimplePie (version 1.2) has code to get around this bug. Unfortunately however, Joomla! is still using the old SimplePie version 1.0.1.

Here is a simple test that can be used to check for this problem (save the following code to a file called "xmltest.php", upload it to the server holding your Joomla! installation and point your browser at it):

Code: Select all

<?php
 
$parser_check = xml_parser_create();
xml_parse_into_struct($parser_check, '<foo>&</foo>', $values);
xml_parser_free($parser_check);
$xml_is_sane = isset($values[0]['value']);
 
if (!$xml_is_sane)
{
    echo "XML is broken!";
} else {
    echo "XML is OK!";
}
 
?>