It's only just occured to me that if I set my editor to iso-8859-1 then type "PØLSE" it's a completely different string to when my editor is in utf-8 mode and I type "PØLSE"
[SOLVED] Detect UTF-8 charset (efficiently) ?
Moderator: General Moderators
- Chris Corbyn
- Breakbeat Nuttzer
- Posts: 13098
- Joined: Wed Mar 24, 2004 7:57 am
- Location: Melbourne, Australia
[SOLVED] Detect UTF-8 charset (efficiently) ?
If I have a string stored in a PHP variable is there any easy way to have a guess at what charset it is. Most importantly, detect between iso-8859-1 and utf-8.
It's only just occured to me that if I set my editor to iso-8859-1 then type "PØLSE" it's a completely different string to when my editor is in utf-8 mode and I type "PØLSE"
It's only just occured to me that if I set my editor to iso-8859-1 then type "PØLSE" it's a completely different string to when my editor is in utf-8 mode and I type "PØLSE"
Last edited by Chris Corbyn on Thu Aug 03, 2006 4:07 am, edited 1 time in total.
- feyd
- Neighborhood Spidermoddy
- Posts: 31559
- Joined: Mon Mar 29, 2004 3:24 pm
- Location: Bothell, Washington, USA
tsk, tsk. Someone didn't look at the manual. mb_detect_encoding()
- Chris Corbyn
- Breakbeat Nuttzer
- Posts: 13098
- Joined: Wed Mar 24, 2004 7:57 am
- Location: Melbourne, Australia
I knew about this sorry should have said:feyd wrote:tsk, tsk. Someone didn't look at the manual. mb_detect_encoding()
mbstring is a non-default extension. This means it is not enabled by default. You must explicitly enable the module with the configure option. See the Install section for details.
At the moment it just defaults to UTF-8 which comes out garbled if you typed, for example "PØLSE" in an editor with iso-8859-1 encoding and then emailed it so you need to call $swift->setCharset('iso-8859-1') first. If it could be detected pretty easily, without the mb string stuff I could just have it switch charsets itself.
Thanks
- Chris Corbyn
- Breakbeat Nuttzer
- Posts: 13098
- Joined: Wed Mar 24, 2004 7:57 am
- Location: Melbourne, Australia
- Chris Corbyn
- Breakbeat Nuttzer
- Posts: 13098
- Joined: Wed Mar 24, 2004 7:57 am
- Location: Melbourne, Australia
OK, the function scans the entire string and is actually checking if the utf-8 is valid or not.... I just want to detect it's existence at all. I need to speed it up if I can to suit my own needs.
I changed:
Code: Select all
// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
// From http://w3.org/International/questions/q ... utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
} // function is_utf8Code: Select all
//Removed the ^ and $, changed * quantifier to + and dropped the ascii range.
function detectUTF8($string)
{
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)+%xs', $string);
}- Chris Corbyn
- Breakbeat Nuttzer
- Posts: 13098
- Joined: Wed Mar 24, 2004 7:57 am
- Location: Melbourne, Australia
OK I can confirm that the above function does work if anybody else ever needs it. It can detect if a string contains UTF-8 characters, but it can't detect what the string is if it's not... (iso-8859-1 probably if you're american or european).
It's pretty fast too and I don't call it too much... as soon as something UTF-8 is detected I never call it again.
It's pretty fast too and I don't call it too much... as soon as something UTF-8 is detected I never call it again.
