Page 1 of 1

Probs with character encodings when creating XML feed

Posted: Mon Sep 13, 2004 12:53 pm
by allyhazell
Hi,

I've been struggling to find an answer to my problem, I've tried scouring both the official PHP site and other sites for answers to it. I have created an RSS/XML feed for a client's web site which is automatically generated from the news in a Mysql database each time a new article is added. The problem is that the feed keeps on invalidating with dodgy characters such as (for example) é, ®, ò etc. The only way I've managed to stop this from happening all the time is by doing an str_replace for those characters that come up often with an encoded equivelent è, ® etc. But that only stops it from invalidating for so long before another one is entered that I haven't yet listed.

So my question is, is there a PHP function or a PHP script somewhere that will change these characters into ones that will work with XML? Or am I doing something wrong within the RSS/XML setup itself?

The address of the feed in question is
http://www.medicalnewstoday.com/medicalnews.xml

Thanks in advance for your help

A frustrated Alastair

Posted: Tue Sep 14, 2004 2:24 pm
by xisle
There is a finite set of special characters and their html entities,
so write one function to take care of it throughout your scripts.
Here is an example of replacing some specific nasty MS Word characters with their entities.

Code: Select all

<?php

function superhtmlentities($text) {
 	$entities = array(
 	128 => 'euro', 
 	130 => 'lsquo', 
 	131 => 'fnof', 
 	132 => 'ldquo', 
 	133 => 'hellip', 
 	134 => 'dagger', 
 	135 => 'Dagger', 
 	136 => 'circ', 
 	137 => 'permil', 
 	138 => 'Scaron', 
 	139 => 'lsaquo', 
 	140 => 'OElig', 
 	145 => 'lsquo', 
 	146 => 'rsquo', 
 	147 => 'ldquo', 
 	148 => 'rdquo', 
 	149 => 'bull', 
 	150 => 'ndash', 
 	151 => 'mdash', 
 	152 => 'tilde', 
 	153 => 'trade', 
 	154 => 'scaron', 
 	155 => 'rsaquo', 
 	156 => 'oelig', 
 	159 => 'Yuml');
 	
 	$new_text = '';
 for($i = 0; $i < strlen($text); $i++) {
   	$num = ord($text{$i});
  	if(in_array($num, array_keys($entities))) {
     	$new_text .= '&'.$entities[$num].';';
   	}
   	elseif($num < 127 || $num > 159) {
     	$new_text .= $text{$i};
   	}
 }
 
 return htmlentities($new_text);
}


?>
Here is a page of character sets and their entities..

http://www.w3schools.com/html/html_entitiesref.asp

Posted: Tue Sep 14, 2004 2:38 pm
by allyhazell
That's great, thanks. I shall give it a try. Bloody Word eh!