Page 1 of 1

Working with UTF-8 chars

Posted: Wed Oct 12, 2005 11:12 am
by Ree
How do I use string manipulation functions with UTF-8 characters? Here's a little function I wrote which truncates passed string to indicated length without chopping part of the last word off:

Code: Select all

function truncate($str, $chars)
{
  $str = substr($str, 0, $chars + 1);
  $length = strlen($str);
  for ($i = $length - 1; $i > 0; $i--)
  {
    if (substr($str, $i, 1) == ' ')
    {
      $check = $i;
      break;
    }
  }
  if (isset($check))
  {
    $str = substr($str, 0, $check + 1) . '...';
  } else
  {
    $str = '';
  }  
  return $str;
}

$str = 'ąeęąčėę ūųūĮŠĮŠĖ įšęįąę ąčęę ąčę ąą ąąą šėš';
echo truncate($str, 40);
I need to get this:

Code: Select all

ąeęąčėę ūųūĮŠĮŠĖ įšęįąę ąčęę ąčę ąą ąąą ...
But I get this:

Code: Select all

ąeęąčėę ūųūĮŠĮŠĖ ...
Of course, that's because, as feyd pointed out, it only works with single byte chars (the above works just fine with standard chars). How do you manipulate multibyte character strings then? It's extremely important to me, since ALL of the sites I'm going to develop in the future (and the one I'm doing atm as well) will use UTF-8. Maybe there's some simple solution I do not know.

Posted: Wed Oct 12, 2005 12:26 pm
by onion2k
http://uk.php.net/manual/en/ref.mbstring.php

Or.. if it's available, use PHP5.

Posted: Wed Oct 12, 2005 1:49 pm
by Ree
Yes, I have found that as well, but there's one BUT.
mbstring is a non-default extension. This means it is not enabled by default. You must explicitly enable the module with the configure option.
Does that mean I may have problems with shared hosts? Or do they usually have it enabled?

Posted: Wed Oct 12, 2005 1:54 pm
by feyd
you may have an issue with hosts yes.

Posted: Wed Oct 12, 2005 2:03 pm
by Ree
I wonder why isn't it enabled by default? To make my life more difficult?? :roll:

Posted: Wed Oct 12, 2005 2:05 pm
by feyd
the majority of sites out there do not deal with languages outside their character set frequenctly enough I would imagine...

Posted: Wed Oct 12, 2005 2:20 pm
by Ree
Not all languages are EN... In my country, almost all websites are multilingual - they same content is usually presented in Lithuanian, Russian and English languages. So if I can truncate EN news items, I need to be able to do the same with RU and LT ones.

Is there another way you could suggest to make the function I posted work with language-specifc chars? Maybe storing strings in the db as htmlentities?

Posted: Wed Oct 12, 2005 2:41 pm
by feyd
it's not all that hard to create your own utf-8 parser..

Posted: Wed Oct 12, 2005 2:49 pm
by onion2k
Ree wrote:Not all languages are EN... In my country, almost all websites are multilingual - they same content is usually presented in Lithuanian, Russian and English languages. So if I can truncate EN news items, I need to be able to do the same with RU and LT ones.
If that's the case then it's quite likely your hosting company will have enabled mbstring. The best way to find out is to try some of the functions and see if you get an error or not.

Posted: Wed Oct 12, 2005 3:02 pm
by Ree
I very much need to make it work on all standard hosts. Local hosts in my country aren't cheap, so they usually host sites of bigger companies. The cheap hosts over here are mere resellers (the physical host is in US usually).
it's not all that hard to create your own utf-8 parser..
I have no idea on how to do this, really :lol:

Posted: Wed Oct 12, 2005 3:12 pm
by feyd
well.. there's this: viewtopic.php?t=36549 which, although not exactly what you need, has references to the texts to read about UTF8 encodings among other details.. You could also reverse its logic creating a UTF8 to HTML entity conversion..

Posted: Thu Oct 13, 2005 1:06 am
by Ree
Here's working version of a few utf8 string functions:

Code: Select all

function utf8_substr($str, $start, $length = null)
  {
    preg_match_all('/./su', $str, $chars);
    if (empty($length))
    {
      $chars[0] = array_slice($chars[0], $start);
    } else
    {
      $chars[0] = array_slice($chars[0], $start, $length);
    }    
    $str = implode('', $chars[0]);
    return $str;
  }

  function utf8_strlen($str)
  {
    preg_match_all('/./su', $str, $chars);
    return count($chars[0]);
  }

  function truncate($str, $chars)
  {
    $length = utf8_strlen($str);
    if ($chars >= $length)
    {
      return $str;
    }
    $str = utf8_substr($str, 0, $chars + 1);
    if (utf8_substr($str, $chars, 1) == ' ')
    {
      return utf8_substr($str, 0, $chars) . ' ...';
    }
    $str = utf8_substr($str, 0, $chars);
    for ($i = $chars - 1; $i >= 0; $i--)
    {
      if (utf8_substr($str, $i, 1) == ' ')
      {
        $mark = $i;
        break;
      }
    }
    if (isset($mark))
    {
      $str = utf8_substr($str, 0, $mark) . ' ...';
    } else
    {
      $str = '';
    }
    return $str;
  }

Code: Select all

$str = 'ąeęąčėę ūųūĮŠĮŠĖ įšęįąę ąčęę ąčę ąą ąąą šėš';
echo truncate($str, 12);

Outputs:
ąeęąčėę ...