[SOLVED] Detect UTF-8 charset (efficiently) ?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

[SOLVED] Detect UTF-8 charset (efficiently) ?

Post by Chris Corbyn »

If I have a string stored in a PHP variable is there any easy way to have a guess at what charset it is. Most importantly, detect between iso-8859-1 and utf-8.

It's only just occured to me that if I set my editor to iso-8859-1 then type "PØLSE" it's a completely different string to when my editor is in utf-8 mode and I type "PØLSE" :oops:
Last edited by Chris Corbyn on Thu Aug 03, 2006 4:07 am, edited 1 time in total.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

tsk, tsk. Someone didn't look at the manual. mb_detect_encoding()
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

feyd wrote:tsk, tsk. Someone didn't look at the manual. mb_detect_encoding()
I knew about this sorry should have said:
mbstring is a non-default extension. This means it is not enabled by default. You must explicitly enable the module with the configure option. See the Install section for details.
:( I was hoping for something that's part of the PHP core or just done in PHP code... you need the mb_string library for those functions sadly and this is for something for public release. It's more of a convenience thing really cos it's for Swift to save having to use setCharset() unless you want to explicitly set it.

At the moment it just defaults to UTF-8 which comes out garbled if you typed, for example "PØLSE" in an editor with iso-8859-1 encoding and then emailed it so you need to call $swift->setCharset('iso-8859-1') first. If it could be detected pretty easily, without the mb string stuff I could just have it switch charsets itself.

Thanks :)
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

OK you can shame me and place the dunce hat on me now. I didn't read the user comments. There's a utf-8 detection function in there which should suffice thanks :)
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

d11wtq wrote:OK you can shame me and place the dunce hat on me now.
It'd be my pleasure.

Image
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

:lol:

OK, the function scans the entire string and is actually checking if the utf-8 is valid or not.... I just want to detect it's existence at all. I need to speed it up if I can to suit my own needs.

I changed:

Code: Select all

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
  
   // From http://w3.org/International/questions/q ... utf-8.html
   return preg_match('%^(?:
         [\x09\x0A\x0D\x20-\x7E]            # ASCII
       | [\xC2-\xDF][\x80-\xBF]            # non-overlong 2-byte
       |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
       | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
       |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
       |  \xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
       | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
       |  \xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
   )*$%xs', $string);
  
} // function is_utf8
To:

Code: Select all

//Removed the ^ and $, changed * quantifier to + and dropped the ascii range.
function detectUTF8($string)
{
	return preg_match('%(?:
	[\xC2-\xDF][\x80-\xBF]				# non-overlong 2-byte
	|\xE0[\xA0-\xBF][\x80-\xBF]			# excluding overlongs
	|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}	# straight 3-byte
	|\xED[\x80-\x9F][\x80-\xBF]			# excluding surrogates
	|\xF0[\x90-\xBF][\x80-\xBF]{2}		# planes 1-3
	|[\xF1-\xF3][\x80-\xBF]{3}			# planes 4-15
	|\xF4[\x80-\x8F][\x80-\xBF]{2}		# plane 16
	)+%xs', $string);
}
Can anyone see a flaw in this? I know nothing about character encoding really but hopefully my pattern is purely looking for any sequence of multibyte (non-ascii) characters and doesn't scan the whole string. It seems to work from my tests.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

OK I can confirm that the above function does work if anybody else ever needs it. It can detect if a string contains UTF-8 characters, but it can't detect what the string is if it's not... (iso-8859-1 probably if you're american or european).

It's pretty fast too and I don't call it too much... as soon as something UTF-8 is detected I never call it again.
Post Reply