Page 1 of 1

RSS parsing creating invalid XML

Posted: Sun Feb 19, 2006 4:11 pm
by jwalsh
Hi,

I'm parsing an XML feed into our generic formatting codes to allow us to format syndicated articles into our current web layout. The content is coming from our direct partners, but occasionally my code creates invalid XML.

Here's the XML error I'm getting.

Code: Select all

XML Parsing Error: undefined entity

to the crowd, �I'm celebrating
--------------^
I thought since I was using htmlentities, it would create valid XML. Here's the important part of my code.

Code: Select all

function reverse_htmlentities($mixed) {
   $htmltable = get_html_translation_table(HTML_ENTITIES);
   foreach($htmltable as $key => $value)
   {
       $mixed = ereg_replace(addslashes($value),$key,$mixed);
   }
   return $mixed;
}

function FormatCode($document) {
	// UNDO HTMLENTITIES SO THAT WE CAN PROPERLY PARSE IMG's AND LINKS
	$document = reverse_htmlentities($document);
	
    // REPLACE CERTAIN HTML TAGS WITH OUR FORMATTING SCHEMA
	$search = array ('/\<strong\>(.*?)\<\/strong\>/is',
		'/\<i\>(.*?)\<\/i\>/is',
		'/\<u\>(.*?)\<\/u\>/is',
		'/\<a href=(.*?) target=_blank\>(.*?)\<\/a\>/is',
		'/\<img (.*?) src=\"(.*?)\" (.*?)\>/e');
		
	$replace = array('{b}$1{/b}',
		'{i}$1{/i}',
		'{u}$1{/u}',
		'{link src=$1}$2{/link}',
		"addtoimage('\\2')");
	
	$text = preg_replace($search, $replace, $document);
	
    // TURN <BR> INTO NEW LINES
	$text = str_replace("<br>", "\n", $text);
	$text = str_replace("<br />", "\n", $text);
	
    // REMOVE EXCESS HTML TAGS
	$text = strip_tags($text);
	
    // REDO HTML ENTITIES FOR VALID XML
	$text = htmlentities($text);
	
	return $text;
}

// COMING FROM AN XML PARSING LOOP
echo FormatCode($article->description);