xml_parse eats < and > from cdata - PHP/libxml bug

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
jørgen
Forum Newbie
Posts: 2
Joined: Sun Sep 13, 2009 10:56 am

xml_parse eats < and > from cdata - PHP/libxml bug

Post by jørgen »

I am trying to debug a problem with SimplePie (RSS/ATOM feed parser) as used in Joomla! 1.5.14 (latest). Having identical (out of the box) installations on different hosting providers I notice a strange problem with the XML parser (as used in SimplePie). I have absolutely no experience with the XML parser used in PHP by the way.

On some installations xml_parse removes '<' and '>' found in cdata (which is not good when sending the news feed description to the browser). On other installations '<' and '>' are translated to '<' and '>' as expected.

As far as I can see the installations are identical except for the libXML and PHP version numbers (libXML version 2.6.27 (PHP Version 5.2.10) on installations working OK and libXML version 2.7.3 (PHP Version 5.2.8 ) on installitions having problems).

xml_parser_create_ns is used to create the parser (encoding=UTF-8, separator= ' '). OPTION_SKIP_WHITE=1, XML_OPTION_CASE_FOLDING=0.

Here is a detailed example. The input to xml_parse is always the same (extract):

Code: Select all

<description><p><a href="http://www.packtpub.com/nominate-best-open-source-php-cms">  ....  
On systems that is working OK, the "character data handler" function (as configured by xml_set_character_data_handler) receives the following cdata fragments (in its second parameter "string $data"):

Code: Select all

(SimplePie_Parser::tag_open tag: description - attributes: a:0:{})
SimplePie_Parser::cdata: '<'
SimplePie_Parser::cdata: 'p'
SimplePie_Parser::cdata: '>'
SimplePie_Parser::cdata: '<'
SimplePie_Parser::cdata: 'a href="http://www.packtpub.com/nominate-best-open-source-php-cms"'
SimplePie_Parser::cdata: '>'
This yields valid HTML: <p><a href="http://www.packtpub.com/nominate-best-o ... ce-php-cms">

On installations having problems it looks like this:

Code: Select all

(SimplePie_Parser::tag_open tag: description - attributes: a:0:{})
SimplePie_Parser::cdata: 'p'
SimplePie_Parser::cdata: 'a href="http://www.packtpub.com/nominate-best-open-source-php-cms"'
As can be seen, fewer calls and the '<' and '>' are just gone!

Everything else (Joomla! etc.) works OK by the way...

Any idea why this happens?
Last edited by jørgen on Mon Sep 14, 2009 6:48 am, edited 2 times in total.
jørgen
Forum Newbie
Posts: 2
Joined: Sun Sep 13, 2009 10:56 am

Re: xml_parse eats &lt; and &gt; from cdata

Post by jørgen »

After some further investigations, this turns out to be a PHP / libxml bug. It affects some installations only:

libxml 2.7.x on PHP < 5.2.9 and
libxml 2.7.0 to 2.7.2 on any PHP version

http://bugs.php.net/bug.php?id=45996
http://bugs.gentoo.org/show_bug.cgi?id=249703
http://blog.code-head.com/fixing-libxml ... ing-libxml
http://blog.code-head.com/fixing-libxml ... s-libexpat
https://glowhost.com/forums/general-sup ... -1574.html
https://bugzilla.redhat.com/show_bug.cgi?id=467314

Newer versions of SimplePie (version 1.2) has code to get around this bug. Unfortunately however, Joomla! is still using the old SimplePie version 1.0.1.

Here is a simple test that can be used to check for this problem (save the following code to a file called "xmltest.php", upload it to the server holding your Joomla! installation and point your browser at it):

Code: Select all

<?php
 
$parser_check = xml_parser_create();
xml_parse_into_struct($parser_check, '<foo>&</foo>', $values);
xml_parser_free($parser_check);
$xml_is_sane = isset($values[0]['value']);
 
if (!$xml_is_sane)
{
    echo "XML is broken!";
} else {
    echo "XML is OK!";
}
 
?>
Post Reply