Working with UTF-8 chars

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
Ree
Forum Regular
Posts: 592
Joined: Fri Jun 10, 2005 1:43 am
Location: LT

Working with UTF-8 chars

Post by Ree »

How do I use string manipulation functions with UTF-8 characters? Here's a little function I wrote which truncates passed string to indicated length without chopping part of the last word off:

Code: Select all

function truncate($str, $chars)
{
  $str = substr($str, 0, $chars + 1);
  $length = strlen($str);
  for ($i = $length - 1; $i > 0; $i--)
  {
    if (substr($str, $i, 1) == ' ')
    {
      $check = $i;
      break;
    }
  }
  if (isset($check))
  {
    $str = substr($str, 0, $check + 1) . '...';
  } else
  {
    $str = '';
  }  
  return $str;
}

$str = 'ąeęąčėę ūųūĮŠĮŠĖ įšęįąę ąčęę ąčę ąą ąąą šėš';
echo truncate($str, 40);
I need to get this:

Code: Select all

ąeęąčėę ūųūĮŠĮŠĖ įšęįąę ąčęę ąčę ąą ąąą ...
But I get this:

Code: Select all

ąeęąčėę ūųūĮŠĮŠĖ ...
Of course, that's because, as feyd pointed out, it only works with single byte chars (the above works just fine with standard chars). How do you manipulate multibyte character strings then? It's extremely important to me, since ALL of the sites I'm going to develop in the future (and the one I'm doing atm as well) will use UTF-8. Maybe there's some simple solution I do not know.
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

http://uk.php.net/manual/en/ref.mbstring.php

Or.. if it's available, use PHP5.
Ree
Forum Regular
Posts: 592
Joined: Fri Jun 10, 2005 1:43 am
Location: LT

Post by Ree »

Yes, I have found that as well, but there's one BUT.
mbstring is a non-default extension. This means it is not enabled by default. You must explicitly enable the module with the configure option.
Does that mean I may have problems with shared hosts? Or do they usually have it enabled?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

you may have an issue with hosts yes.
Ree
Forum Regular
Posts: 592
Joined: Fri Jun 10, 2005 1:43 am
Location: LT

Post by Ree »

I wonder why isn't it enabled by default? To make my life more difficult?? :roll:
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

the majority of sites out there do not deal with languages outside their character set frequenctly enough I would imagine...
Ree
Forum Regular
Posts: 592
Joined: Fri Jun 10, 2005 1:43 am
Location: LT

Post by Ree »

Not all languages are EN... In my country, almost all websites are multilingual - they same content is usually presented in Lithuanian, Russian and English languages. So if I can truncate EN news items, I need to be able to do the same with RU and LT ones.

Is there another way you could suggest to make the function I posted work with language-specifc chars? Maybe storing strings in the db as htmlentities?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

it's not all that hard to create your own utf-8 parser..
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

Ree wrote:Not all languages are EN... In my country, almost all websites are multilingual - they same content is usually presented in Lithuanian, Russian and English languages. So if I can truncate EN news items, I need to be able to do the same with RU and LT ones.
If that's the case then it's quite likely your hosting company will have enabled mbstring. The best way to find out is to try some of the functions and see if you get an error or not.
Ree
Forum Regular
Posts: 592
Joined: Fri Jun 10, 2005 1:43 am
Location: LT

Post by Ree »

I very much need to make it work on all standard hosts. Local hosts in my country aren't cheap, so they usually host sites of bigger companies. The cheap hosts over here are mere resellers (the physical host is in US usually).
it's not all that hard to create your own utf-8 parser..
I have no idea on how to do this, really :lol:
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

well.. there's this: viewtopic.php?t=36549 which, although not exactly what you need, has references to the texts to read about UTF8 encodings among other details.. You could also reverse its logic creating a UTF8 to HTML entity conversion..
Ree
Forum Regular
Posts: 592
Joined: Fri Jun 10, 2005 1:43 am
Location: LT

Post by Ree »

Here's working version of a few utf8 string functions:

Code: Select all

function utf8_substr($str, $start, $length = null)
  {
    preg_match_all('/./su', $str, $chars);
    if (empty($length))
    {
      $chars[0] = array_slice($chars[0], $start);
    } else
    {
      $chars[0] = array_slice($chars[0], $start, $length);
    }    
    $str = implode('', $chars[0]);
    return $str;
  }

  function utf8_strlen($str)
  {
    preg_match_all('/./su', $str, $chars);
    return count($chars[0]);
  }

  function truncate($str, $chars)
  {
    $length = utf8_strlen($str);
    if ($chars >= $length)
    {
      return $str;
    }
    $str = utf8_substr($str, 0, $chars + 1);
    if (utf8_substr($str, $chars, 1) == ' ')
    {
      return utf8_substr($str, 0, $chars) . ' ...';
    }
    $str = utf8_substr($str, 0, $chars);
    for ($i = $chars - 1; $i >= 0; $i--)
    {
      if (utf8_substr($str, $i, 1) == ' ')
      {
        $mark = $i;
        break;
      }
    }
    if (isset($mark))
    {
      $str = utf8_substr($str, 0, $mark) . ' ...';
    } else
    {
      $str = '';
    }
    return $str;
  }

Code: Select all

$str = 'ąeęąčėę ūųūĮŠĮŠĖ įšęįąę ąčęę ąčę ąą ąąą šėš';
echo truncate($str, 12);

Outputs:
ąeęąčėę ...
Post Reply