XML text problem

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
raghavan20
DevNet Resident
Posts: 1451
Joined: Sat Jun 11, 2005 6:57 am
Location: London, UK
Contact:

XML text problem

Post by raghavan20 »

I am trying to generate a rss feed and i use feed from nytimes. the feed has a got a few characters such as â that makes the xml document invalid. i do not know to clean it. i though if i create a text DOM element, it would automatically alter to make it valid but it did not do it. how to fix this?

Code: Select all

<item>
      <title>Lisbon Journal: A Song Form Is Updated, but Not in the Alleys of Its Originhttp://www.nytimes.com/2007/02/21/world/europe/21portugal.html?ex=1329714000&en=09de80043e8dc8cc&ei=5088&partner=rssnyt&emc=rss</title>
      <link>http://www.nytimes.com/2007/02/21/world/europe/21portugal.html?ex=1329714000</link>
      <description>The traditional music known as fado, which means fate, has been reinvented to become Portugal&acirc;</description>
    </item>
my code

Code: Select all

foreach ($rssPostList as $feed) {
 			
		 		//create item
		 		$item = $doc->createElement( "item" );
		 		
		 		$title = $doc->createElement( "title"  );
		 		$title->appendChild( $doc->createTextNode( trim( $feed->get('title') ) ) );
		 		
		 		$link = $doc->createElement( "link", trim( $feed->get('link') ) );
		 		$title->appendChild( $doc->createTextNode( trim( $feed->get('link') ) ) );
		 		
		 		$desription = $doc->createElement( "description", trim( $feed->get('description') ) );
		 		$title->appendChild( $doc->createTextNode( trim( $feed->get('description') ) ) );
		 		
		 		//append title, link, description, language, publishedDate to channel
		 		$item->appendChild( $title );
		 		$item->appendChild( $link );
		 		$item->appendChild( $desription );
 				
		 		$itemList->appendChild( $item );	
		 		
 			}
Any help is deeply appreciated.
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

acirc is a named entity that has to be declared before it can be used in a xml document.
xhtml is implicitly attached to a dtd that imports the declaration of acirc
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd wrote:<!ENTITY % HTMLlat1 PUBLIC
"-//W3C//ENTITIES Latin 1 for XHTML//EN"
"xhtml-lat1.ent">
%HTMLlat1;
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent wrote:<!ENTITY acirc "&#226;"> <!-- latin small letter a with circumflex,
U+00E2 ISOlat1 -->
But rss is not xhtml and your rss parser does not import such a dtd and/or set of entities. You might change the xml document, making it import the entities.

Code: Select all

<?xml version="1.0" ?>
<!DOCTYPE RssXhtml [
	<!ENTITY % HTMLlat1 PUBLIC
       "-//W3C//ENTITIES Latin 1 for XHTML//EN"
       "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
    %HTMLlat1;
	]>
<root>
	&acirc;
</root>
but I doubt that's the favoured way of dealing with the problem ;)
Instead use numerical entities like &#226; for â
User avatar
raghavan20
DevNet Resident
Posts: 1451
Joined: Sat Jun 11, 2005 6:57 am
Location: London, UK
Contact:

Post by raghavan20 »

thanks for your reply.
i understand what you are saying. is there is any php function or any way available to conver to numerical entities?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

htmlentities(), htmlspecialchars() or possibly a some fancy str_replace()/preg_replace() type thing.
User avatar
raghavan20
DevNet Resident
Posts: 1451
Joined: Sat Jun 11, 2005 6:57 am
Location: London, UK
Contact:

Post by raghavan20 »

feyd wrote:htmlentities(), htmlspecialchars() or possibly a some fancy str_replace()/preg_replace() type thing.
was not he saying something like htmlentities are not supported in xml and we have to use numerical entities. when i googled, i found numerical entities are like in hex. feyd, are you saying html entities should work, i have not tried yet.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

That all depends on what is in your text.

A regular expression (with a callback) would probably be more specifically suited.
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

Just a thought, why not simply convert the offending tag's contents from a TEXT node to a CDATA node - after all, it's intended for xhtml rendering after all, isn't it?

Code: Select all

$description = $doc->createElement( "description" );
$description->appendChild( $doc->createCDATASection( trim( $feed->get('description') ) ) );
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

But that doesn't take care of the warnings when the "original" xml doc is read.
feyd wrote:A regular expression (with a callback) would probably be more specifically suited.
Although I'd like to let libxml take care of that it's probably overkill and I agree, a simple regex will do

Code: Select all

<?php
function substitute_latin1_entities($text) {
	static $entities = null;
	if ( is_null($entities) ) {
		$c = file_get_contents('xhtml-lat1.ent'); // local copy of http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
		$pattern = '/<!ENTITY\s+(\S+)\s+"([^"]+)"/';
		preg_match_all($pattern, $c, $matches);
		$matches[1] = array_map(create_function('$x', 'return "&".$x.";";'), $matches[1]);
		$entities = array('search'=>$matches[1], 'replace'=>$matches[2]);
	}
	
	return str_replace($entities['search'], $entities['replace'], $text);
}


echo substitute_latin1_entities('abc&acirc;xyz');
?>
Post Reply