I need to clean source html and / or xml, namely get rid of special characters such as ” (non-standard quotes). The Yahoo! News RSS feeds very often come with them and as a result the RSS parser either completely refuses to parse it or parses it with errors.
Has anyone needed to solve this problem before?
Thanks!
Tomas
How to get rid of non-latin characters?
Moderator: General Moderators
-
tomfra
- Forum Contributor
- Posts: 126
- Joined: Wed Jun 23, 2004 12:56 pm
- Location: Prague, Czech Republic
Yes, this is probably the solution I will use (str_replace) but because I have no idea how many of these problematic characters they use in the RSS feed it will be far from 100% reliable. Is there a class or code snippet (or a built-in PHP function?) that removes all characters not present in iso-8859-1 (for example)?
It would be a cleaner solution.
Tomas
It would be a cleaner solution.
Tomas
create list of allowed character and build regexp to remove any character outside the list. hint:
would remove any character which aren't either a, b, c, d, e or f.
Code: Select all
$string = preg_replace('/[abcdef]/', '', $string);