Reading characters in a character set *without* mb support
Posted: Wed Dec 12, 2007 7:02 pm
Ok, so the new version of Swift Mailer is in development and one of the recurring bugs in the current version was this issue of QP encoding generating gibberish on occassion. The problem was that QP encoding works on a character-for-character basis and NOT a byte-for-byte basis, so for UTF-8 and other multibyte character sets things go a bit whacky.
I can use mb_substring() to read one character at a time, but I for one do not have mb_* compiled into my PHP installation so I need a simple fallback.
If mb_* is installed, I'll use that, otherwise I need my own implementation of just one tiny little thing: The abililty to scan a string byte-for-byte until I have one complete character.
I need some character set guru to tell me if this will work however (read the comments above the method name):
So I'd have Utf8Validator, Utf16Validator, UsAsciiValidator etc etc and load the one which fits the current character set.
If I have a string and I want to split it into an array of characters I'd use this sort of algorithm ($string is a simple stream-like wrapper to make this example simpler).
In the case of Utf-8 the validator would repeatedly return 0, 1 or 2 (or -1 if the string is corrupt) since Utf-8 characters contain 1, 2 or 3 bytes.
Can anyone offer a faster algorithm than this? Can anyone pick holes in the viability of using this approach?
I'm no character set expert so I'm all ears
NOTE: All I need to be able to do is read a string (or file stream) one character at a time, if I can do that, everything else will fall into place.
EDIT | I'm away at a music festival until Monday so if there's a lack of response before then that's why
I can use mb_substring() to read one character at a time, but I for one do not have mb_* compiled into my PHP installation so I need a simple fallback.
If mb_* is installed, I'll use that, otherwise I need my own implementation of just one tiny little thing: The abililty to scan a string byte-for-byte until I have one complete character.
I need some character set guru to tell me if this will work however (read the comments above the method name):
Code: Select all
/**
* Analyzes characters for a specific character set.
* @package Swift
* @subpackage Encoder
* @author Chris Corbyn
*/
interface Swift_CharacterSetValidator
{
/**
* Returns an integer which specifies how many more bytes to read.
* A positive integer indicates the number of more bytes to fetch before invoking
* this method again.
* A value of zero means this is already a valid character.
* A value of -1 means this cannot possibly be a valid character.
* @param string $partialCharacter
* @return int
*/
public function validateCharacter($partialCharacter);
}If I have a string and I want to split it into an array of characters I'd use this sort of algorithm ($string is a simple stream-like wrapper to make this example simpler).
Code: Select all
$chars = array();
$currentChar = '';
$byteCount = 1;
$pos = 0;
while (0 != strlen($str)) {
//Shift $byteCount bytes off the start of the string
for ($i =0; $i < $byteCount; $i++) {
$currentChar .= substr($string, 0, 1);
$string = substr($string, 1);
}
//See what validator says for number of more bytes to fetch
$byteCount = $validator->validateCharacter($currentChar);
if (-1 == $byteCount) {
//Error
} elseif (0 == $byteCount) {
//This is a valid character in this charset
$chars[] = $currentChar;
$currentChar = '';
$byteCount = 1;
}
}
var_dump($chars);Can anyone offer a faster algorithm than this? Can anyone pick holes in the viability of using this approach?
I'm no character set expert so I'm all ears
NOTE: All I need to be able to do is read a string (or file stream) one character at a time, if I can do that, everything else will fall into place.
EDIT | I'm away at a music festival until Monday so if there's a lack of response before then that's why