How to get rid of non-latin characters?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
tomfra
Forum Contributor
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

How to get rid of non-latin characters?

Post by tomfra »

I need to clean source html and / or xml, namely get rid of special characters such as ” (non-standard quotes). The Yahoo! News RSS feeds very often come with them and as a result the RSS parser either completely refuses to parse it or parses it with errors.

Has anyone needed to solve this problem before?

Thanks!

Tomas
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

create a string with all the characters in it, then either iterate over the string using str_replace to wipe the characters away, or use a regular expression to handle it all in one hit.
tomfra
Forum Contributor
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

Post by tomfra »

Yes, this is probably the solution I will use (str_replace) but because I have no idea how many of these problematic characters they use in the RSS feed it will be far from 100% reliable. Is there a class or code snippet (or a built-in PHP function?) that removes all characters not present in iso-8859-1 (for example)?

It would be a cleaner solution.

Tomas
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

not that I know of.. at least one built directly for it.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

create list of allowed character and build regexp to remove any character outside the list. hint:

Code: Select all

$string = preg_replace('/[abcdef]/', '', $string);
would remove any character which aren't either a, b, c, d, e or f.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

btw
[php_man]setlocale[/php_man]
[php_man]ctype_alnum[/php_man]||[php_man]ctype_punct[/php_man]||[php_man]ctype_space[/php_man] may be of help
Post Reply