Page 1 of 1
Can't remove special characters?
Posted: Fri Jun 20, 2008 10:30 am
by csdavis0906
I'm working on a RSS aggregator and am having a problem stripping a few special characters from article titles, an example of which is the following "Ice on Mars! Now you see it, now you don’t" taken the MSNBC site. My program displays the single quote in "don't" as a series of characters including a small TM (trademark) symbol. I've tried htmlspecialcharacters_decode() but can't get rid of those pesky chars.
To see the problem, you can go to
http://technology.newsalizr.com and go down to the Space section.
Any assistance would be greatly appreciated. Thanking you in advance...
Re: Can't remove special characters?
Posted: Fri Jun 20, 2008 1:44 pm
by phice
Have you tried using regular expressions to remove anything other than a-zA-Z0-9\-\_\.\s and a few other characters (!, ?, etc)? I think that would be best.
Re: Can't remove special characters?
Posted: Fri Jun 20, 2008 2:00 pm
by csdavis0906
Yes, that's where I started. Got rid of all othere unwanted characters but not the one I posted about.
Re: Can't remove special characters?
Posted: Fri Jun 20, 2008 3:58 pm
by dml
It's not a normal single quote, it's a
curly right single quote. You're emitting a utf8 representation of it, which is the three bytes 0xE2 0x80 0x99, but your page is declaring itself as iso8859-1, so those three bytes are being read as â, €, and ™. There are a few ways of fixing this. You can make sure that the encoding your data is in is consistent with the encoding declared in your http headers, for example by declaring in your Content-Type header that the text is utf8. Another approach is to use an escape for the character: ’ or ’.
Re: Can't remove special characters?
Posted: Fri Jun 20, 2008 5:36 pm
by csdavis0906
I've tried to remove the chars you pointed out with the code below - but the results were the same.
Code: Select all
$apos = array("/’/", "/’/");
$title = str_replace($apos, "", $title);
Thanks alot for your assistance!