Page 1 of 1
I need a multibyte charset guru :(
Posted: Wed Apr 25, 2007 2:14 pm
by Chris Corbyn
I wouldn't usually do this, but because the initial thread is posting in Swift Mailer's forum I'll just link to it.
viewtopic.php?p=376608#376608
Quick summary of QP encoding:
In a UTF-8 string, the number of bytes per-character can vary, but QP encoding simply turns each individual byte into =XX where XX is the hexadecimal value. Lines cannot exceed 76 characters in QP encoding and end with "=" followed by CRLF if the line has to be chopped. An "=" followed by CRLF is simply disregarded in the decoded output so that the string appears as it was.
Because of the line-length limit I need to break the line and add the "=" to the end (known as a soft break) however, unknown to me until now, you cannot split an individual character across multiple lines, so how can I determine what are whole characters in a multibyte string?
I cannot use the mutlibyte functions by the way since this has to work on a vanilla PHP installation.
Hopefully someone can shed load light on understanding the structure of UTF-8?
EDIT | Where's Ambush Commander when ya need him?
EDIT | w00t! I found
Ambush Commander's article I seemed to remember about and have been following links from that. I'm less in the dark than I was but I still do not know how to figure out what is a character and what is just a byte from a multibyte character. I'll explain further what I do:
I currently do an extremely basic:
Code: Select all
for ($i =0; $ < strlen($string); $i++)
{
$ord = ord($string{$i});
//Check if it's a permitted byte, then either append it to $result, or sprintf("=%02X", $ord);
}
If I could change that loop to this I'd be happy:
Code: Select all
while ($char = mb_substr($string, 0, 1))
{
//I have a character, but it could be any number of bytes
$string = mb_substr($string, 1); //Move along the string
}
So on that note if someone could help me with a mb_substr equivalent that doesn't require the mb_.. functions I'd be very happy

Posted: Wed Apr 25, 2007 3:08 pm
by Ambush Commander
Mmm... intriguing. I know how one would brute force fix the problem, but I'm trying to think of a more elegant way to fix it.
For every UTF-8 character, you know the exact length of the character from the very first byte (because we're talking about hexadecimal pairs, this would be the first two xdigits). From 00-7F, the character is one byte and you can cut with impunity. C2-DF is two bytes, E0-EF is three bytes and F0-F4 is four bytes. So, the brute force method is to assemble the string byte by byte, ensuring that when you hit a multibyte sequence you have enough characters left to finish up the line, otherwise, you break, and then start over again.
I can't say much more then that because I have never used that encoding before.
Posted: Wed Apr 25, 2007 3:10 pm
by stereofrog
Well, if you have an UTF string, this
Code: Select all
# untested
$utf= '/[\x00-\x7F]|[\xC2-\xDF].|[\xE0-\xEF]..|[\xF0-\xF4].../s';
preg_match_all($utf, $string, $m);
$chars = $m[0];
should split it into individual "characters", 1 to 4 bytes each.
(Hope I understood the question correctly)

Posted: Wed Apr 25, 2007 3:17 pm
by Ambush Commander
stereofrog's method will work, but it's not terribly efficient. Since, according to your edit which I missed previously, you're already doing byte by byte processing, the checks could be run in parallel with the loop. However...
How are you chopping the string?
Posted: Wed Apr 25, 2007 3:18 pm
by Chris Corbyn
Thank you so much both of you!
Ambush completely cleared up how I know what sequences to expect, and stereofrog gave a brilliant example for how to grab characters
I should be able to work on this over the next couple of days.
/me ponders what impact this will have on other multibyte charsets (maybe I'll do a "function_exists('mb_substr')").
UTF-8 is simple now that I know that about the starting bytes.
Posted: Wed Apr 25, 2007 3:24 pm
by Ambush Commander
Hmm... that's going to be tedious. You'll have to research each of the encodings you want to support and determine their byte lengths. There really ought to be a better way. If you use mb_* things would work out very nicely, due to its multiple character encodings support. If you use iconv, you could convert everything to UTF-8 and then encode it accordingly.
Posted: Wed Apr 25, 2007 3:27 pm
by Chris Corbyn
Ambush Commander wrote:stereofrog's method will work, but it's not terribly efficient. Since, according to your edit which I missed previously, you're already doing byte by byte processing, the checks could be run in parallel with the loop. However...
How are you chopping the string?
With a basic preg_match_all() followed by an implode. I don't check anything other than violation of QP rules (lines cannot end with a "=" less than 3 characters from the end of the string unless it is part of a soft-break.
Code: Select all
/**
* Return the QP encoded version of a string with no breaks
* @param string The input to encode
* @param boolean True if the data we're encoding is binary
* @return string
*/
function rawQPEncode($string, $bin=false)
{
$ret = "";
if (!$bin)
{
$string = str_replace(array("\r\n", "\r"), "\n", $string);
$string = str_replace("\n", "\r\n", $string);
}
$len = strlen($string);
for ($i = 0; $i < $len; $i++)
{
$val = ord($string{$i});
//9, 32 = HT, SP; 10, 13 = CR, LF; 33-60 & 62-126 are ok
// 63 is '?' and needs encoding to go in the headers
if ((!$bin && ($val == 32 || $val == 9 || $val == 10 || $val == 13))
|| ($val >= 33 && $val <= 60) || ($val >= 62 && $val <= 126)
&& $val != 63)
{
$ret .= $string{$i};
}
else
{
$ret .= sprintf("=%02X", $val);
}
}
return $ret;
}
I post-process the string to add the soft breaks for reasons I won't go into (related to MIME and line lengths in headers).
Code: Select all
//I don't like the use of preg.. here. Look at changing this?
preg_match_all('/.{1,'.$line_length.'}([^=]{0,3})?/', $string, $matches);
$result = implode("=" . $le, $matches[0]);
I've never been very happy with the code so I'm glad a bug has been brought to light in a way because I can hopefully write something I feel happy with after knowing what the problems are. I have no issues totally rewriting the QP encoding stuff from scratch.
Posted: Wed Apr 25, 2007 3:30 pm
by Chris Corbyn
Ambush Commander wrote:Hmm... that's going to be tedious. You'll have to research each of the encodings you want to support and determine their byte lengths. There really ought to be a better way. If you use mb_* things would work out very nicely, due to its multiple character encodings support. If you use iconv, you could convert everything to UTF-8 and then encode it accordingly.
We're overlapping here so I'll give it ten mins after posting this, but I cannot change the encodings since users would probably find it too intrusive. People usually choose character encodings for one reason or another specific to their needs (or configuration) so I have to work with what I get. I should maybe start working an a class which tries to identify the encoding then delegates jobs to appropriate functions rather than trying to cover everything in one function. Right now, UTF-8 has my full attention.
EDIT | preg_match("/^.{1}/us", $str, $match) finds the first character, even if it's multibyte UTF-8. It won't work if the encoding is invalid though, which I'm guessing things like Thai-874 would break.
Re: I need a multibyte charset guru :(
Posted: Wed Apr 25, 2007 4:07 pm
by Oren
d11wtq wrote:I currently do an extremely basic:
Code: Select all
for ($i =0; $ < strlen($string); $i++)
{
$ord = ord($string{$i});
//Check if it's a permitted byte, then either append it to $result, or sprintf("=%02X", $ord);
}
It's a bit off topic but... you don't really do it like this right? I hope you do the
strlen($string) part before and not on each iteration again and again

Posted: Wed Apr 25, 2007 4:08 pm
by Ambush Commander
He doesn't. Check the full code.
Working on it right now.
Posted: Wed Apr 25, 2007 4:09 pm
by Oren
Dude... that was quick!
Posted: Wed Apr 25, 2007 4:44 pm
by Ambush Commander
I took a whack at it, but we might want to pass this one off to Feyd to be safe.

The main problem I see is that we can't use my method without merging the rawQPEncode and QPEncode functions, since rawQPEncode, which is doing the byte-by-byte analysis, needs to know when to insert a break (some funky parameter passing would be necessary). Add in the binary handling code makes things a bit messier then they should be, the fact that I'm still not exactly certain how QP encoding works (too lazy to read into it in depth) and the fact that I don't really have a got battery of unit tests, and I think you'll have to fix this one yourself. Sorry.
Posted: Wed Apr 25, 2007 4:49 pm
by Chris Corbyn
Ambush Commander wrote:I took a whack at it, but we might want to pass this one off to Feyd to be safe.

The main problem I see is that we can't use my method without merging the rawQPEncode and QPEncode functions, since rawQPEncode, which is doing the byte-by-byte analysis, needs to know when to insert a break (some funky parameter passing would be necessary). Add in the binary handling code makes things a bit messier then they should be, the fact that I'm still not exactly certain how QP encoding works (too lazy to read into it in depth) and the fact that I don't really have a got battery of unit tests, and I think you'll have to fix this one yourself. Sorry.
Wow, thanks for trying, you're a legend
I'm actually fine with merging the two methods provided I can specify:
Max length of all lines
Max length of first line *
Binary or not (affects line endings more than anything)
* First line sometimes has something like "Subject: =?Q?utf-8? .... before it so the QP encoded bit is actually shorter.
I'm having a shot at it myself too

Posted: Thu Apr 26, 2007 2:48 pm
by Chris Corbyn
This is obvious and I don't need to worry about what the charset is. If I know QP encoding turns high value bytes into =XX surely I just need to make sure there's no "=[0-9A-F]{2}" at the end of a line when I cut it, because anything other sequence at the end of a line will just an ascii character

Obviously I have to make sure I don't chop "=XX" in half itself too.
This task just became so much clearer and much less scary, unless I've completely overlooked something
EDIT | No, it was a good idea but if you don't have *any* ascii characters in your string then this won't work. Hmmff

Re: I need a multibyte charset guru :(
Posted: Fri Nov 07, 2008 6:34 am
by 1mdm
By the way, this problem still burning. And it's not about QP encoding only. Base64 is also concerned.
I attempted to hack into Swift_Message_Encoder, but I didn't find the way to fix it with small effort…