Page 1 of 1
[SOLVED] Detect UTF-8 charset (efficiently) ?
Posted: Wed Aug 02, 2006 6:11 pm
by Chris Corbyn
If I have a string stored in a PHP variable is there any easy way to have a guess at what charset it is. Most importantly, detect between iso-8859-1 and utf-8.
It's only just occured to me that if I set my editor to iso-8859-1 then type "PØLSE" it's a completely different string to when my editor is in utf-8 mode and I type "PØLSE"

Posted: Wed Aug 02, 2006 6:15 pm
by feyd
tsk, tsk. Someone didn't look at the manual.
mb_detect_encoding()
Posted: Wed Aug 02, 2006 6:29 pm
by Chris Corbyn
I knew about this sorry should have said:
mbstring is a non-default extension. This means it is not enabled by default. You must explicitly enable the module with the configure option. See the Install section for details.

I was hoping for something that's part of the PHP core or just done in PHP code... you need the mb_string library for those functions sadly and this is for something for public release. It's more of a convenience thing really cos it's for Swift to save having to use setCharset() unless you want to explicitly set it.
At the moment it just defaults to UTF-8 which comes out garbled if you typed, for example "PØLSE" in an editor with iso-8859-1 encoding and then emailed it so you need to call $swift->setCharset('iso-8859-1') first. If it could be detected pretty easily, without the mb string stuff I could just have it switch charsets itself.
Thanks

Posted: Wed Aug 02, 2006 6:40 pm
by Chris Corbyn
OK you can shame me and place the dunce hat on me now. I didn't read the user comments. There's a utf-8 detection function in there which should suffice thanks

Posted: Wed Aug 02, 2006 6:46 pm
by feyd
d11wtq wrote:OK you can shame me and place the dunce hat on me now.
It'd be my pleasure.

Posted: Wed Aug 02, 2006 7:16 pm
by Chris Corbyn
OK, the function scans the entire string and is actually checking if the utf-8 is valid or not.... I just want to detect it's existence at all. I need to speed it up if I can to suit my own needs.
I changed:
Code: Select all
// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
// From http://w3.org/International/questions/q ... utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
} // function is_utf8
To:
Code: Select all
//Removed the ^ and $, changed * quantifier to + and dropped the ascii range.
function detectUTF8($string)
{
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)+%xs', $string);
}
Can anyone see a flaw in this? I know nothing about character encoding really but hopefully my pattern is purely looking for any sequence of multibyte (non-ascii) characters and doesn't scan the whole string. It seems to work from my tests.
Posted: Thu Aug 03, 2006 4:06 am
by Chris Corbyn
OK I can confirm that the above function does work if anybody else ever needs it. It can detect if a string contains UTF-8 characters, but it can't detect what the string is if it's not... (iso-8859-1 probably if you're american or european).
It's pretty fast too and I don't call it too much... as soon as something UTF-8 is detected I never call it again.