Page 1 of 1

How to get rid of non-latin characters?

Posted: Wed Sep 29, 2004 9:58 am
by tomfra
I need to clean source html and / or xml, namely get rid of special characters such as ” (non-standard quotes). The Yahoo! News RSS feeds very often come with them and as a result the RSS parser either completely refuses to parse it or parses it with errors.

Has anyone needed to solve this problem before?

Thanks!

Tomas

Posted: Wed Sep 29, 2004 10:03 am
by feyd
create a string with all the characters in it, then either iterate over the string using str_replace to wipe the characters away, or use a regular expression to handle it all in one hit.

Posted: Wed Sep 29, 2004 10:13 am
by tomfra
Yes, this is probably the solution I will use (str_replace) but because I have no idea how many of these problematic characters they use in the RSS feed it will be far from 100% reliable. Is there a class or code snippet (or a built-in PHP function?) that removes all characters not present in iso-8859-1 (for example)?

It would be a cleaner solution.

Tomas

Posted: Wed Sep 29, 2004 10:17 am
by feyd
not that I know of.. at least one built directly for it.

Posted: Wed Sep 29, 2004 2:46 pm
by Weirdan
create list of allowed character and build regexp to remove any character outside the list. hint:

Code: Select all

$string = preg_replace('/[abcdef]/', '', $string);
would remove any character which aren't either a, b, c, d, e or f.

Posted: Wed Sep 29, 2004 2:53 pm
by Weirdan
btw
[php_man]setlocale[/php_man]
[php_man]ctype_alnum[/php_man]||[php_man]ctype_punct[/php_man]||[php_man]ctype_space[/php_man] may be of help