Page 1 of 1

XML text problem

Posted: Wed Feb 21, 2007 1:41 pm
by raghavan20
I am trying to generate a rss feed and i use feed from nytimes. the feed has a got a few characters such as â that makes the xml document invalid. i do not know to clean it. i though if i create a text DOM element, it would automatically alter to make it valid but it did not do it. how to fix this?

Code: Select all

<item>
      <title>Lisbon Journal: A Song Form Is Updated, but Not in the Alleys of Its Originhttp://www.nytimes.com/2007/02/21/world/europe/21portugal.html?ex=1329714000&en=09de80043e8dc8cc&ei=5088&partner=rssnyt&emc=rss</title>
      <link>http://www.nytimes.com/2007/02/21/world/europe/21portugal.html?ex=1329714000</link>
      <description>The traditional music known as fado, which means fate, has been reinvented to become Portugal&acirc;</description>
    </item>
my code

Code: Select all

foreach ($rssPostList as $feed) {
 			
		 		//create item
		 		$item = $doc->createElement( "item" );
		 		
		 		$title = $doc->createElement( "title"  );
		 		$title->appendChild( $doc->createTextNode( trim( $feed->get('title') ) ) );
		 		
		 		$link = $doc->createElement( "link", trim( $feed->get('link') ) );
		 		$title->appendChild( $doc->createTextNode( trim( $feed->get('link') ) ) );
		 		
		 		$desription = $doc->createElement( "description", trim( $feed->get('description') ) );
		 		$title->appendChild( $doc->createTextNode( trim( $feed->get('description') ) ) );
		 		
		 		//append title, link, description, language, publishedDate to channel
		 		$item->appendChild( $title );
		 		$item->appendChild( $link );
		 		$item->appendChild( $desription );
 				
		 		$itemList->appendChild( $item );	
		 		
 			}
Any help is deeply appreciated.

Posted: Wed Feb 21, 2007 3:45 pm
by volka
acirc is a named entity that has to be declared before it can be used in a xml document.
xhtml is implicitly attached to a dtd that imports the declaration of acirc
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd wrote:<!ENTITY % HTMLlat1 PUBLIC
"-//W3C//ENTITIES Latin 1 for XHTML//EN"
"xhtml-lat1.ent">
%HTMLlat1;
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent wrote:<!ENTITY acirc "&#226;"> <!-- latin small letter a with circumflex,
U+00E2 ISOlat1 -->
But rss is not xhtml and your rss parser does not import such a dtd and/or set of entities. You might change the xml document, making it import the entities.

Code: Select all

<?xml version="1.0" ?>
<!DOCTYPE RssXhtml [
	<!ENTITY % HTMLlat1 PUBLIC
       "-//W3C//ENTITIES Latin 1 for XHTML//EN"
       "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
    %HTMLlat1;
	]>
<root>
	&acirc;
</root>
but I doubt that's the favoured way of dealing with the problem ;)
Instead use numerical entities like &#226; for รข

Posted: Wed Feb 21, 2007 4:29 pm
by raghavan20
thanks for your reply.
i understand what you are saying. is there is any php function or any way available to conver to numerical entities?

Posted: Wed Feb 21, 2007 4:32 pm
by feyd
htmlentities(), htmlspecialchars() or possibly a some fancy str_replace()/preg_replace() type thing.

Posted: Wed Feb 21, 2007 4:37 pm
by raghavan20
feyd wrote:htmlentities(), htmlspecialchars() or possibly a some fancy str_replace()/preg_replace() type thing.
was not he saying something like htmlentities are not supported in xml and we have to use numerical entities. when i googled, i found numerical entities are like in hex. feyd, are you saying html entities should work, i have not tried yet.

Posted: Wed Feb 21, 2007 4:42 pm
by feyd
That all depends on what is in your text.

A regular expression (with a callback) would probably be more specifically suited.

Posted: Wed Feb 21, 2007 10:29 pm
by Kieran Huggins
Just a thought, why not simply convert the offending tag's contents from a TEXT node to a CDATA node - after all, it's intended for xhtml rendering after all, isn't it?

Code: Select all

$description = $doc->createElement( "description" );
$description->appendChild( $doc->createCDATASection( trim( $feed->get('description') ) ) );

Posted: Wed Feb 21, 2007 11:31 pm
by volka
But that doesn't take care of the warnings when the "original" xml doc is read.
feyd wrote:A regular expression (with a callback) would probably be more specifically suited.
Although I'd like to let libxml take care of that it's probably overkill and I agree, a simple regex will do

Code: Select all

<?php
function substitute_latin1_entities($text) {
	static $entities = null;
	if ( is_null($entities) ) {
		$c = file_get_contents('xhtml-lat1.ent'); // local copy of http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
		$pattern = '/<!ENTITY\s+(\S+)\s+"([^"]+)"/';
		preg_match_all($pattern, $c, $matches);
		$matches[1] = array_map(create_function('$x', 'return "&".$x.";";'), $matches[1]);
		$entities = array('search'=>$matches[1], 'replace'=>$matches[2]);
	}
	
	return str_replace($entities['search'], $entities['replace'], $text);
}


echo substitute_latin1_entities('abc&acirc;xyz');
?>