Learn UTF8

Small, short code snippets that other people may find useful. Do you have a good regex that you would like to share? Share it! Even better, the code can be commented on, and improved.

Moderator: General Moderators

Post Reply
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Learn UTF8

Post by Ambush Commander »

So you want to know how to read UTF8? Here's some nice, verbose PHP code to tell you how it works.

You should also read up here:

* http://www.randomchaos.com/documents/?s ... nd_unicode
* http://en.wikipedia.org/wiki/UTF-8

Code: Select all

function utf8_to_character_array($string) {
        
        $character_array    = array();
        $values             = array();
        $looking_for        = 1;
        $string_length      = strlen($string);
        
        //each iteration represents a byte
        for ($i = 0; $i < $string_length; $i++) {
            
            //get integer value of this
            $value = ord($string[$i]);
            
            if ($looking_for == 1) { //we are looking for the beginning
                
                if ($value < 128) { //check if byte begins with zero
                    //it does, simple ASCII character
                    $character_array[] = $value;
                } elseif ($value >= 128 && $value < 224) { //check if byte is 110xxxxx
                    $looking_for = 2;   //character is two bytes
                    $values[] = $value; //save the byte for later processing
                    continue;
                } elseif ($value >= 224 && $value < 239) { //check if byte is 1110xxxx
                    $looking_for = 3;
                    $values[] = $value; //save the byte for later processing
                    continue;
                } elseif ($value >= 239 && $value < 247) { //check if byte is 11110xxx
                    //unimplemented, ignore
                } else {
                    //nonsensical byte, ignore
                    continue;
                } 
                
            } elseif ($looking_for == 2) { //two byte character
                
                //sanity check
                if (!($value >= 128 && $value < 192)) { //check if byte isn't 10xxxxxx
                    //nonsensical byte, ignore
                    continue;
                }
                
                $values[] = $value;
                
                //extract x's from 110xxxxx 10xxxxxx
                $character_array[] = (($values[0] % 32) * 64) +
                                     ($values[1] % 64);
                
            } elseif ($looking_for == 3) { //three byte character
                
                //sanity check
                if (!($value >= 128 && $value < 192)) { //check if byte isn't 10xxxxxx
                    //nonsensical byte, ignore
                    continue;
                }
                
                $values[] = $value;
                
                if (count($values) == 2) { //is there one last byte?
                    continue;
                }
                
                //extract x's from 1110xxxx 10xxxxxx 10xxxxxx
                $character_array[] = (($values[0] % 16) * 4096) +
                                     (($values[1] % 64 ) * 64) +
                                     ($values[2] % 64);
                
            } elseif ($looking_for == 4) { //four byte character
                //unimplemented, ignore
            }
            
            //cleanup
            $looking_for = 1;
            $values = array();
            
        }
    
        return $character_array;
        
    }
And if someone would tell me how to figure out the four byte characters that would be nice. :D (yes, I had an alternative, but I decided to try rolling my own to learn more about UTF8's internals. I learned more about binary and hexadecimal than I ever want to for the rest of my life)

Edit Meh, PEAR does this too... never thought of using the shift right...

Code: Select all

if ($value >> 5 == 6) {
                        $values[] = ($value - 192) << 6;
                        $search   = 2;
                    } elseif ($value >> 4 == 14) {
                        $values[] = ($value - 224) << 12;
                        $search   = 3;
I really need to learn more about this bitwise operators.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

You may find parse_utf8 function useful (for example to see another implementation): viewtopic.php?p=134535#134535
Post Reply