php & Unicode 5.1 characters

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

User avatar
ddragas
Forum Contributor
Posts: 445
Joined: Sun Apr 18, 2004 4:01 pm

php & Unicode 5.1 characters

Post by ddragas »

Hi all

I need to create function that returns Unicode 5.1 number of character

for example:
if I give character "Đ" to function it should return number "0110"

(please check picture)

can somebody point me in right direction
what functions should I use?

thank you and kind regards
Attachments
ScreenShot015.jpg
ScreenShot015.jpg (129.28 KiB) Viewed 566 times
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: php & Unicode 5.1 characters

Post by Chris Corbyn »

Until PHP 6 comes out, PHP is not Unicode aware.

Even then, I don't think we'll have a new "char" type that would hold the Unicode value you're in need of. I'm in the process of writing encoders and decoders for PHP 5 (that are unicode aware). So far I can decode UTF-8 streams into sequences of octets logically grouped by character, and a series of integers representing the unicode values (UCS-4) of those characters.

I just had a quick flick through the Multibyte String functions and can't see anything in there that returns the unicode value of the character but you may find something useful anyway.

Perhaps parsing the output of this:

http://au2.php.net/manual/en/function.m ... entity.php

What character encoding are you working with?

PHP really needs char and byte types :(
User avatar
ddragas
Forum Contributor
Posts: 445
Joined: Sun Apr 18, 2004 4:01 pm

Re: php & Unicode 5.1 characters

Post by ddragas »

Hi Chris and thank you for quick reply

all characters are in utf-8 and database too.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: php & Unicode 5.1 characters

Post by Chris Corbyn »

I can confirm that I am able to decode that character to Hex 0110 using my code.

If you're using UTF-8 I'd be happy to share.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: php & Unicode 5.1 characters

Post by Chris Corbyn »

2 seconds... I'll put some code up. It's terribly unfinished, but the stuff you need is there.
User avatar
ddragas
Forum Contributor
Posts: 445
Joined: Sun Apr 18, 2004 4:01 pm

Re: php & Unicode 5.1 characters

Post by ddragas »

great :D

hardly waiting
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: php & Unicode 5.1 characters

Post by Apollo »

If you have string representing 'Đ' in utf-8 encoding, it's no problem at all converting that to 0x110. Has nothing to do with your particular PHP version being Unicode complient or not. Just decode the utf-8 by hand. I guess Chris Corbyn is about to post this, but if he has something else in mind, I'll post another solution.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: php & Unicode 5.1 characters

Post by Chris Corbyn »

Sorry, took me a while to strip out the development stuff I'm building and to wrap it with a convenient function.

Here's how you use it:

Code: Select all

<?php
 
require_once dirname(__FILE__) . '/../get_ucs4_value.php';
 
echo dechex(get_ucs4_value('?')); //110
There's also a version that gets an array of unicode characters from a string:

Code: Select all

 
<?php
 
require_once dirname(__FILE__) . '/../get_ucs4_value.php';
 
$ucs4 = get_ucs4_values('??? ?? ??????? ???????? ??????????????, ??? ???? ????? ?????? ??.');
 
foreach ($ucs4 as $value) { //PHP's integers are decimal... we'll present them as hexadecimal
  printf("%08X\n", $value);
}
 
/*
0000041C
0000043E
00000433
00000020
0000043D
00000435
00000020
0000043F
0000043E
0000043C
0000043D
00000438
00000442
0000044C
00000020
0000043D
00000438
0000043A
00000430
0000043A
0000043E
 
... and so on ...
*/
NOTE: My code works but is very much a half-built development version. I haven't added the support for replacing ill-formed data yet (you'll see it commented out).
Attachments
ucs4-handling.zip
(12.83 KiB) Downloaded 109 times
User avatar
ddragas
Forum Contributor
Posts: 445
Joined: Sun Apr 18, 2004 4:01 pm

Re: php & Unicode 5.1 characters

Post by ddragas »

thank you Chris for code example

I'm getting value 400 instead 0110

here is complete code

Code: Select all

 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 
<head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
    <title>Untitled 1</title>
</head>
 
<body>
<?php
 
 require_once 'get_ucs4_value.php';
  
 echo dechex(get_ucs4_value('?')); //110
 
?>
</body>
</html>
 
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: php & Unicode 5.1 characters

Post by Chris Corbyn »

Something is not UTF-8 in which case. I certainly get 110.

What happens if you change:

Code: Select all

echo dechex(get_ucs4_value(utf8_encode('?'))); //110
?
User avatar
ddragas
Forum Contributor
Posts: 445
Joined: Sun Apr 18, 2004 4:01 pm

Re: php & Unicode 5.1 characters

Post by ddragas »

Already tried this

I get as response

d0
User avatar
ddragas
Forum Contributor
Posts: 445
Joined: Sun Apr 18, 2004 4:01 pm

Re: php & Unicode 5.1 characters

Post by ddragas »

sorry - my mistake

file was not saved as utf-8 :oops:

it is working now :D

thank you for help
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: php & Unicode 5.1 characters

Post by Chris Corbyn »

No problem :)
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: php & Unicode 5.1 characters

Post by Apollo »

Chris, I'm sure your code works fine, but it looks way more complicated than necessary. Here's my version:

Code: Select all

function ExtractUtf8Codes( $s ) // convert utf8 encoded string to array of separate unicode codes
{
    $invalid = 0x3f; // code for invalid chars (0x3f = '?')
    $codes = array();
    for($i=0;;)
    {
        $c = ord($s[$i++]);
        if (!$c) break;
        if (!($c & 0x80)) { $codes[] = $c; continue; } // single byte char
        $n = 0;
        while ($c & (128 >> $n)) $n++;
        if ($n<2 || $n>6) { $codes[] = $invalid; continue; } // invalid char (should be 11etc)
        $x = $c & ((1 << (8-$n))-1); // get top bits
        for(;$n>1;$n--)
        {
            $c = ord($s[$i]);
            if (($c & 0xC0)!=0x80) { $codes[] = $invalid; continue; } // invalid char (subsequent chars should be 10etc)
            $x = ($x << 6) | ($c & 0x3F); // append bits
            $i++;
        }
        $codes[] = $x;
    }
    return $codes;
}
 
$example = chr(0x61).chr(0xc4).chr(0x90).chr(0xe2).chr(0x82).chr(0xac); // $example contains 'AЀ' in utf8 encoding
$codes = ExtractUtf8Codes( $example );
// $codes is now array(0x61,0x110,0x20AC)
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: php & Unicode 5.1 characters

Post by Chris Corbyn »

Yeah, mine's taken from a larger OOP system that need to handle multiple character encodings where the input stream may be from different sources (file, string).

I agree that using yours for this particular problem would be better :) I didn't write mine to solve this problem, I just had it lying around from part of a much larger project.

EDIT | Yours will be a lot slower for larger strings BTW due to the repeated ord() usage. Some of the verbosity of mine is because it needs to be fast (it's part of Swift Mailer).
Post Reply