Page 1 of 2

php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 7:29 am
by ddragas
Hi all

I need to create function that returns Unicode 5.1 number of character

for example:
if I give character "Đ" to function it should return number "0110"

(please check picture)

can somebody point me in right direction
what functions should I use?

thank you and kind regards

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 7:48 am
by Chris Corbyn
Until PHP 6 comes out, PHP is not Unicode aware.

Even then, I don't think we'll have a new "char" type that would hold the Unicode value you're in need of. I'm in the process of writing encoders and decoders for PHP 5 (that are unicode aware). So far I can decode UTF-8 streams into sequences of octets logically grouped by character, and a series of integers representing the unicode values (UCS-4) of those characters.

I just had a quick flick through the Multibyte String functions and can't see anything in there that returns the unicode value of the character but you may find something useful anyway.

Perhaps parsing the output of this:

http://au2.php.net/manual/en/function.m ... entity.php

What character encoding are you working with?

PHP really needs char and byte types :(

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 7:52 am
by ddragas
Hi Chris and thank you for quick reply

all characters are in utf-8 and database too.

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 7:56 am
by Chris Corbyn
I can confirm that I am able to decode that character to Hex 0110 using my code.

If you're using UTF-8 I'd be happy to share.

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 7:56 am
by Chris Corbyn
2 seconds... I'll put some code up. It's terribly unfinished, but the stuff you need is there.

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 8:18 am
by ddragas
great :D

hardly waiting

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 8:19 am
by Apollo
If you have string representing 'Đ' in utf-8 encoding, it's no problem at all converting that to 0x110. Has nothing to do with your particular PHP version being Unicode complient or not. Just decode the utf-8 by hand. I guess Chris Corbyn is about to post this, but if he has something else in mind, I'll post another solution.

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 8:43 am
by Chris Corbyn
Sorry, took me a while to strip out the development stuff I'm building and to wrap it with a convenient function.

Here's how you use it:

Code: Select all

<?php
 
require_once dirname(__FILE__) . '/../get_ucs4_value.php';
 
echo dechex(get_ucs4_value('?')); //110
There's also a version that gets an array of unicode characters from a string:

Code: Select all

 
<?php
 
require_once dirname(__FILE__) . '/../get_ucs4_value.php';
 
$ucs4 = get_ucs4_values('??? ?? ??????? ???????? ??????????????, ??? ???? ????? ?????? ??.');
 
foreach ($ucs4 as $value) { //PHP's integers are decimal... we'll present them as hexadecimal
  printf("%08X\n", $value);
}
 
/*
0000041C
0000043E
00000433
00000020
0000043D
00000435
00000020
0000043F
0000043E
0000043C
0000043D
00000438
00000442
0000044C
00000020
0000043D
00000438
0000043A
00000430
0000043A
0000043E
 
... and so on ...
*/
NOTE: My code works but is very much a half-built development version. I haven't added the support for replacing ill-formed data yet (you'll see it commented out).

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 8:54 am
by ddragas
thank you Chris for code example

I'm getting value 400 instead 0110

here is complete code

Code: Select all

 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 
<head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
    <title>Untitled 1</title>
</head>
 
<body>
<?php
 
 require_once 'get_ucs4_value.php';
  
 echo dechex(get_ucs4_value('?')); //110
 
?>
</body>
</html>
 

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 8:57 am
by Chris Corbyn
Something is not UTF-8 in which case. I certainly get 110.

What happens if you change:

Code: Select all

echo dechex(get_ucs4_value(utf8_encode('?'))); //110
?

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 9:00 am
by ddragas
Already tried this

I get as response

d0

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 9:02 am
by ddragas
sorry - my mistake

file was not saved as utf-8 :oops:

it is working now :D

thank you for help

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 9:04 am
by Chris Corbyn
No problem :)

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 9:49 am
by Apollo
Chris, I'm sure your code works fine, but it looks way more complicated than necessary. Here's my version:

Code: Select all

function ExtractUtf8Codes( $s ) // convert utf8 encoded string to array of separate unicode codes
{
    $invalid = 0x3f; // code for invalid chars (0x3f = '?')
    $codes = array();
    for($i=0;;)
    {
        $c = ord($s[$i++]);
        if (!$c) break;
        if (!($c & 0x80)) { $codes[] = $c; continue; } // single byte char
        $n = 0;
        while ($c & (128 >> $n)) $n++;
        if ($n<2 || $n>6) { $codes[] = $invalid; continue; } // invalid char (should be 11etc)
        $x = $c & ((1 << (8-$n))-1); // get top bits
        for(;$n>1;$n--)
        {
            $c = ord($s[$i]);
            if (($c & 0xC0)!=0x80) { $codes[] = $invalid; continue; } // invalid char (subsequent chars should be 10etc)
            $x = ($x << 6) | ($c & 0x3F); // append bits
            $i++;
        }
        $codes[] = $x;
    }
    return $codes;
}
 
$example = chr(0x61).chr(0xc4).chr(0x90).chr(0xe2).chr(0x82).chr(0xac); // $example contains 'AЀ' in utf8 encoding
$codes = ExtractUtf8Codes( $example );
// $codes is now array(0x61,0x110,0x20AC)

Re: php & Unicode 5.1 characters

Posted: Thu Apr 02, 2009 10:25 am
by Chris Corbyn
Yeah, mine's taken from a larger OOP system that need to handle multiple character encodings where the input stream may be from different sources (file, string).

I agree that using yours for this particular problem would be better :) I didn't write mine to solve this problem, I just had it lying around from part of a much larger project.

EDIT | Yours will be a lot slower for larger strings BTW due to the repeated ord() usage. Some of the verbosity of mine is because it needs to be fast (it's part of Swift Mailer).