Reading characters in a character set *without* mb support

XML, Perl, Python, and other languages can be discussed here, even if it isn't PHP (We might forgive you).

Moderator: General Moderators

Post Reply
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Reading characters in a character set *without* mb support

Post by Chris Corbyn »

Ok, so the new version of Swift Mailer is in development and one of the recurring bugs in the current version was this issue of QP encoding generating gibberish on occassion. The problem was that QP encoding works on a character-for-character basis and NOT a byte-for-byte basis, so for UTF-8 and other multibyte character sets things go a bit whacky.

I can use mb_substring() to read one character at a time, but I for one do not have mb_* compiled into my PHP installation so I need a simple fallback.

If mb_* is installed, I'll use that, otherwise I need my own implementation of just one tiny little thing: The abililty to scan a string byte-for-byte until I have one complete character.

I need some character set guru to tell me if this will work however (read the comments above the method name):

Code: Select all

/**
 * Analyzes characters for a specific character set.
 * @package Swift
 * @subpackage Encoder
 * @author Chris Corbyn
 */
interface Swift_CharacterSetValidator
{

  /**
   * Returns an integer which specifies how many more bytes to read.
   * A positive integer indicates the number of more bytes to fetch before invoking
   * this method again.
   * A value of zero means this is already a valid character.
   * A value of -1 means this cannot possibly be a valid character.
   * @param string $partialCharacter
   * @return int
   */
  public function validateCharacter($partialCharacter);
  
}
So I'd have Utf8Validator, Utf16Validator, UsAsciiValidator etc etc and load the one which fits the current character set.

If I have a string and I want to split it into an array of characters I'd use this sort of algorithm ($string is a simple stream-like wrapper to make this example simpler).

Code: Select all

$chars = array();
$currentChar = '';
$byteCount = 1;
$pos = 0;
while (0 != strlen($str)) {
  
  //Shift $byteCount bytes off the start of the string
  for ($i =0; $i < $byteCount; $i++) {
    $currentChar .= substr($string, 0, 1);
    $string = substr($string, 1);
  }
  
  //See what validator says for number of more bytes to fetch
  $byteCount = $validator->validateCharacter($currentChar);
  if (-1 == $byteCount) {
    //Error
  } elseif (0 == $byteCount) {
    //This is a valid character in this charset
    $chars[] = $currentChar;
    $currentChar = '';
    $byteCount = 1;
  }
  
}

var_dump($chars);
In the case of Utf-8 the validator would repeatedly return 0, 1 or 2 (or -1 if the string is corrupt) since Utf-8 characters contain 1, 2 or 3 bytes.

Can anyone offer a faster algorithm than this? Can anyone pick holes in the viability of using this approach?

I'm no character set expert so I'm all ears :)

NOTE: All I need to be able to do is read a string (or file stream) one character at a time, if I can do that, everything else will fall into place.

EDIT | I'm away at a music festival until Monday so if there's a lack of response before then that's why ;)
Last edited by Chris Corbyn on Wed Dec 12, 2007 7:10 pm, edited 1 time in total.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Here's the most basic one: Us-Ascii.

Code: Select all

<?php

/*
 Analyzes US-ASCII characters.
 
 This program is free software: you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation, either version 3 of the License, or
 (at your option) any later version.
 
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.

 You should have received a copy of the GNU General Public License
 along with this program.  If not, see <http://www.gnu.org/licenses/>.
 
 */

require_once dirname(__FILE__) . '/../CharacterSetValidator.php';


/**
 * Analyzes US-ASCII characters.
 * @package Swift
 * @subpackage Encoder
 * @author Chris Corbyn
 */
class Swift_CharacterSetValidator_UsAsciiValidator
  implements Swift_CharacterSetValidator
{

  /**
   * Returns an integer which specifies how many more bytes to read.
   * A positive integer indicates the number of more bytes to fetch before invoking
   * this method again.
   * A value of zero means this is already a valid character.
   * A value of -1 means this cannot possibly be a valid character.
   * @param string $partialCharacter
   * @return int
   */
  public function validateCharacter($partialCharacter)
  {
    $bytes = unpack('C*', $partialCharacter);
    if (1 == count($bytes) && $bytes[1] >= 0 && $bytes[1] < 128)
    {
      return 0;
    }
    else
    {
      return -1;
    }
  }
  
}

Code: Select all

<?php

require_once 'Swift/CharacterSetValidator/UsAsciiValidator.php';

class Swift_CharacterSetValidator_UsAsciiValidatorTest
  extends UnitTestCase
{
  
  /*
  
  for ($c = '', $size = 1; false !== $bytes = $os->read($size); )
  {
    $c .= $bytes;
    $size = $v->validateCharacter($c);
    if (-1 == $size)
    {
      throw new Exception( ... invalid char .. );
    }
    elseif (0 == $size)
    {
      return $c; //next character in $os
    }
  }
  
  */
  
  private $_validator;
  
  public function setUp()
  {
    $this->_validator = new Swift_CharacterSetValidator_UsAsciiValidator();
  }
  
  public function testAllValidAsciiCharactersReturnZero()
  {
    for ($ordinal = 0; $ordinal < 128; ++$ordinal)
    {
      $char = pack('C', $ordinal);
      $this->assertIdentical(0, $this->_validator->validateCharacter($char));
    }
  }
  
  public function testMultipleBytesAreInvalid()
  {
    for ($ordinal = 0; $ordinal < 128; $ordinal += 2)
    {
      $char = pack('C', $ordinal) . pack('C', $ordinal + 1);
      $this->assertIdentical(-1, $this->_validator->validateCharacter($char));
    }
  }
  
  public function testBytesAboveAsciiRangeAreInvalid()
  {
    for ($ordinal = 128; $ordinal < 255; ++$ordinal)
    {
      $char = pack('C', $ordinal);
      $this->assertIdentical(-1, $this->_validator->validateCharacter($char));
    }
  }
  
}
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Ogle this code a little. It will require a little adapting, but the byte value comparisons will lead you in the right direction. From the first byte you consume, you should be able to determine how long the character is.

It doesn't look like the code, as it stands, works, because $byteCount is always 1.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Ambush Commander wrote:Ogle this code a little. It will require a little adapting, but the byte value comparisons will lead you in the right direction. From the first byte you consume, you should be able to determine how long the character is.

It doesn't look like the code, as it stands, works, because $byteCount is always 1.
Thanks dude. I'll have a look at this later. I'm aware this is "one of those areas" developers are pretty ignorant about until you actually realise it's a crazily signifcant area to have an in depth understanding about if you plan on doing operations on text.

I'll be sure to ask lots of questions whilst I get this implemented :)

Thanks to Kieran for the email too!
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Ok, now I'm feeling a lot better and a lot more clued up. So much so that I'm fairly sure wikipedia is displaying incorrect information:

http://en.wikipedia.org/wiki/UTF-8
Width by first byte:

Code: Select all

Binary 	Hexadecimal 	Decimal 	Width
00000000-01111111 	00-7F 	0-127 	1 byte
11000010-11011111 	C2-DF 	194-223 	2 bytes
11100000-11101111 	E0-EF 	224-239 	3 bytes
11110000-11110100 	F0-F4 	240-244 	4 bytes
However, I could swear I arrive at these ranges:

Code: Select all

00 - 7F = 1 byte
C0 - DF = 2 bytes
E0 - EF = 3 bytes
F0 - F7 = 4 bytes
F8 - FB = 5 bytes
FC - FD = 6 bytes
Can someone say which is right and which is wrong? :(
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

PHP encodes UTF-8 with up to 4 bytes which agrees to Wikipedia. Where did you get the 5 byte range from?

The package Ambush Commander is the standard solution in PHP as far as I know. You take a string, convert it to a UCS-4 array of 32bit codepoints (i.e. all codepoints are made equal length which simplifies a lot), and once you have the array the rest is gravy. The same technique with variations is common to any PHP library required to handle UTF-8 encoded strings for parsing (e.g. internationalised domain name encoding to ASCII punycode). Rather than byte by byte, you just go from UCS-4 array element to element. Converting back and forth is simple enough.

Not sure if you need it for Swiftmailer, but there are also UTF-7 variations of the same theme which convert to/from UCS-4 arrays. Not an expert on the mailing topic but is that useful for SMTP in some way?
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

It is theoretically possible to go further than four bytes, if one follows the pattern that UTF-8 specifies. However RFC 3629 restricted UTF-8 so that the maximum size was four bytes.

Chris Corbyn: No, it's correct. C0 and C1 require two byte sequences, but the code points they represent are lower than 127, so it's "overlong" (i.e. using two bytes when it could use one).
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Ah I see.

Just thinking about my class naming. "Validator" isn't the right word since I'm not actually looking to approve or reject a string, I'm just looking for hints on where the next character starts. I should really have called at a CharacterSetScanner or something.

Either way (before making any adjustments due to above 2 posts) this is kinda what I came up with for UTF-8 (needs more work now though). I'm curious how easy it'll be to create one for Thai-874, ISO-8859-* and most other common charsets though. Fixed width ones should be easy.

Code: Select all

<?php

/*
 Analyzes UTF-8 characters.
 
 This program is free software: you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation, either version 3 of the License, or
 (at your option) any later version.
 
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.

 You should have received a copy of the GNU General Public License
 along with this program.  If not, see <http://www.gnu.org/licenses/>.
 
 */

require_once dirname(__FILE__) . '/../CharacterSetValidator.php';


/**
 * Analyzes UTF-8 characters.
 * @package Swift
 * @subpackage Encoder
 * @author Chris Corbyn
 */
class Swift_CharacterSetValidator_Utf8Validator
  implements Swift_CharacterSetValidator
{
  
  /**
   * Returns an integer which specifies how many more bytes to read.
   * A positive integer indicates the number of more bytes to fetch before invoking
   * this method again.
   * A value of zero means this is already a valid character.
   * A value of -1 means this cannot possibly be a valid character.
   * @param string $partialCharacter
   * @return int
   */
  public function validateCharacter($partialCharacter)
  {
    $bytes = array_values(unpack('C*', $partialCharacter));
    
    $b = $bytes[0];
    
    if ($b >= 0x00 && $b <= 0x7F)
    {
      $expected = 1;
    }
    elseif ($b >= 0xC0 && $b <= 0xDF)
    {
      $expected = 2;
    }
    elseif ($b >= 0xE0 && $b <= 0xEF)
    {
      $expected = 3;
    }
    elseif ($b >= 0xF0 && $b <= 0xF7)
    {
      $expected = 4;
    }
    elseif ($b >= 0xF8 && $b <= 0xFB)
    {
      $expected = 5;
    }
    elseif ($b >= 0xFC && $b <= 0xFD)
    {
      $expected = 6;
    }
    else
    {
      $expected = 0;
    }
    
    $needed = $expected - count($bytes);
    if ($needed < 0)
    {
      $needed = -1;
    }
    
    return $needed;
  }
  
}

Code: Select all

<?php

require_once 'Swift/CharacterSetValidator/Utf8Validator.php';

class Swift_CharacterSetValidator_Utf8ValidatorTest
  extends UnitTestCase
{
  
  private $_validator;
  
  public function setUp()
  {
    $this->_validator = new Swift_CharacterSetValidator_Utf8Validator();
  }
  
  public function testLeading7BitOctetCausesReturnZero()
  { 
    for ($ordinal = 0x00; $ordinal <= 0x7F; ++$ordinal)
    {
      $char = pack('C', $ordinal);
      $this->assertIdentical(0, $this->_validator->validateCharacter($char));
    }
  }
  
  public function testLeadingByteOf2OctetCharCausesReturn1()
  {
    for ($octet = 0xC0; $octet <= 0xDF; ++$octet)
    {
      $char = pack('C', $octet);
      $this->assertIdentical(1, $this->_validator->validateCharacter($char));
    }
  }
  
  public function testLeadingByteOf3OctetCharCausesReturn2()
  {
    for ($octet = 0xE0; $octet <= 0xEF; ++$octet)
    {
      $char = pack('C', $octet);
      $this->assertIdentical(2, $this->_validator->validateCharacter($char));
    }
  }
  
  public function testLeadingByteOf4OctetCharCausesReturn3()
  {
    for ($octet = 0xF0; $octet <= 0xF7; ++$octet)
    {
      $char = pack('C', $octet);
      $this->assertIdentical(3, $this->_validator->validateCharacter($char));
    }
  }
  
  public function testLeadingByteOf5OctetCharCausesReturn4()
  {
    for ($octet = 0xF8; $octet <= 0xFB; ++$octet)
    {
      $char = pack('C', $octet);
      $this->assertIdentical(4, $this->_validator->validateCharacter($char));
    }
  }
  
  public function testLeadingByteOf6OctetCharCausesReturn5()
  {
    for ($octet = 0xFC; $octet <= 0xFD; ++$octet)
    {
      $char = pack('C', $octet);
      $this->assertIdentical(5, $this->_validator->validateCharacter($char));
    }
  }
  
  public function testOctetsFEandFFAreInvalid()
  {
    $char = pack('C', 0xFE);
    $this->assertIdentical(-1, $this->_validator->validateCharacter($char));
    
    $char = pack('C', 0xFF);
    $this->assertIdentical(-1, $this->_validator->validateCharacter($char));
  }
  
}
It seemed to work fine for the UTF-8 lipsum (looked like Polish or something) I tried it on when used with my CharacterStream class. Take a look at the write() method for an algorithm example.

Code: Select all

<?php

/*
 CharacterStream implementation using an array in Swift Mailer.
 
 This program is free software: you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation, either version 3 of the License, or
 (at your option) any later version.
 
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.

 You should have received a copy of the GNU General Public License
 along with this program.  If not, see <http://www.gnu.org/licenses/>.
 
 */

require_once dirname(__FILE__) . '/../CharacterStream.php';
require_once dirname(__FILE__) . '/../ByteStream.php';


/**
 * A CharacterStream implementation which stores characters in an internal array.
 * @package Swift
 * @subpackage CharacterStream
 * @author Chris Corbyn
 */
class Swift_CharacterStream_ArrayCharacterStream
  implements Swift_CharacterStream
{

  /**
   * The validator (lazy-loaded) for the current charset.
   * @var Swift_CharacterSetValidator
   * @access private
   */
  private $_charsetValidator;
  
  /**
   * A factory for creatiing CharacterSetValidator instances.
   * @var Swift_CharacterSetValidatorFactory
   * @access private
   */
  private $_charsetValidatorFactory;
  
  /**
   * The character set this stream is using.
   * @var string
   * @access private
   */
  private $_charset;
  
  /**
   * Array of characters.
   * @var string[]
   * @access private
   */
  private $_array = array();
  
  /**
   * The current character offset in the stream.
   * @var int
   * @access private
   */
  private $_offset = 0;
  
  /**
   * Create a new CharacterStream with the given $chars, if set.
   * @param mixed $chars as string or array
   * @param string $charset used in the stream
   * @param Swift_CharacterSetValidatorFactory $factory for loading validators
   */
  public function __construct($chars = null, $charset = null,
    Swift_CharacterSetValidatorFactory $factory = null)
  {
    if (!is_null($charset))
    {
      $this->setCharacterSet($charset);
    }
    
    if (!is_null($factory))
    {
      $this->setCharacterSetValidatorFactory($factory);
    }
    
    if (is_array($chars))
    {
      $this->_array = $chars;
    }
    elseif (is_string($chars))
    {
      $this->importString($chars);
    }
  }
  
  /**
   * Set the character set used in this CharacterStream.
   * @param string $charset
   */
  public function setCharacterSet($charset)
  {
    $this->_charset = $charset;
  }
  
  /**
   * Set the CharacterSetValidatorFactory for multi charset support.
   * @param Swift_CharacterSetValidatorFactory $factory
   */
  public function setCharacterSetValidatorFactory(
    Swift_CharacterSetValidatorFactory $factory)
  {
    $this->_charsetValidatorFactory = $factory;
  }
  
  /**
   * Overwrite this character stream using the byte sequence in the byte stream.
   * @param Swift_ByteStream $os output stream to read from
   */
  public function importByteStream(Swift_ByteStream $os)
  {
    if (!isset($this->_charsetValidator))
    {
      $this->_charsetValidator = $this->_charsetValidatorFactory
        ->getValidatorFor($this->_charset);
    }
    
    $c = ''; $offset = 0; $need = 1;
    
    while (false !== $bytes = $os->read($need))
    {
      $offset += $need;
      $c .= $bytes;
      $need = $this->_charsetValidator->validateCharacter($c);
      if (0 == $need)
      {
        $need = 1;
        $this->_array[] = $c;
        $c = '';
      }
      elseif (-1 == $need)
      {
        throw new Exception(
          'Invalid ' . $this->_charset . ' data at byte offset ' . $offset .
          ' (after ' . count($this->_array) . ' chars).'
          );
      }
    }
  }
  
  /**
   * Import a string a bytes into this CharacterStream, overwriting any existing
   * data in the stream.
   * @param string $string
   */
  public function importString($string)
  {
    $this->flushContents();
    $this->write($string);
  }
  
  /**
   * Read $length characters from the stream and move the internal pointer
   * $length further into the stream.
   * @param int $length
   * @return string[]
   */
  public function read($length)
  {
    if ($this->_offset == count($this->_array))
    {
      return false;
    }
    
    $ret = array_slice($this->_array, $this->_offset, $length);
    $this->_offset += count($ret);
    return implode('', $ret);
  }
  
  /**
   * Write $chars to the end of the stream.
   * @param string $chars
   */
  public function write($chars)
  {
    if (!isset($this->_charsetValidator))
    {
      $this->_charsetValidator = $this->_charsetValidatorFactory
        ->getValidatorFor($this->_charset);
    }
    
    $c = ''; $offset = 0; $need = 1;
    
    while (strlen($chars) > 0)
    {
      $offset += $need;
      $c .= substr($chars, 0, $need);
      $chars = substr($chars, $need);
      $need = $this->_charsetValidator->validateCharacter($c);
      if (0 == $need)
      {
        $need = 1;
        $this->_array[] = $c;
        $c = '';
      }
      elseif (-1 == $need)
      {
        throw new Exception(
          'Invalid ' . $this->_charset . ' data at byte offset ' . $offset .
          ' (after ' . count($this->_array) . ' chars).'
          );
      }
    }
  }
  
  /**
   * Move the internal pointer to $charOffset in the stream.
   * @param int $charOffset
   */
  public function setPointer($charOffset)
  {
    if ($charOffset > count($this->_array))
    {
      $charOffset = count($this->_array);
    }
    elseif ($charOffset < 0)
    {
      $charOffset = 0;
    }
    $this->_offset = $charOffset;
  }
  
  /**
   * Empty the stream and reset the internal pointer.
   */
  public function flushContents()
  {
    $this->_offset = 0;
    $this->_array = array();
  }
  
}
Post Reply