Page 1 of 1

I need a multibyte charset guru :(

Posted: Wed Apr 25, 2007 2:14 pm
by Chris Corbyn
I wouldn't usually do this, but because the initial thread is posting in Swift Mailer's forum I'll just link to it.

viewtopic.php?p=376608#376608

Quick summary of QP encoding:

In a UTF-8 string, the number of bytes per-character can vary, but QP encoding simply turns each individual byte into =XX where XX is the hexadecimal value. Lines cannot exceed 76 characters in QP encoding and end with "=" followed by CRLF if the line has to be chopped. An "=" followed by CRLF is simply disregarded in the decoded output so that the string appears as it was.

Because of the line-length limit I need to break the line and add the "=" to the end (known as a soft break) however, unknown to me until now, you cannot split an individual character across multiple lines, so how can I determine what are whole characters in a multibyte string? :(

I cannot use the mutlibyte functions by the way since this has to work on a vanilla PHP installation.

Hopefully someone can shed load light on understanding the structure of UTF-8?

EDIT | Where's Ambush Commander when ya need him? :P

EDIT | w00t! I found Ambush Commander's article I seemed to remember about and have been following links from that. I'm less in the dark than I was but I still do not know how to figure out what is a character and what is just a byte from a multibyte character. I'll explain further what I do:

I currently do an extremely basic:

Code: Select all

for ($i =0; $ < strlen($string); $i++)
{
  $ord = ord($string{$i});
  //Check if it's a permitted byte, then either append it to $result, or sprintf("=%02X", $ord);
}
If I could change that loop to this I'd be happy:

Code: Select all

while ($char = mb_substr($string, 0, 1))
{
  //I have a character, but it could be any number of bytes
  $string = mb_substr($string, 1); //Move along the string
}
So on that note if someone could help me with a mb_substr equivalent that doesn't require the mb_.. functions I'd be very happy :)

Posted: Wed Apr 25, 2007 3:08 pm
by Ambush Commander
Mmm... intriguing. I know how one would brute force fix the problem, but I'm trying to think of a more elegant way to fix it.

For every UTF-8 character, you know the exact length of the character from the very first byte (because we're talking about hexadecimal pairs, this would be the first two xdigits). From 00-7F, the character is one byte and you can cut with impunity. C2-DF is two bytes, E0-EF is three bytes and F0-F4 is four bytes. So, the brute force method is to assemble the string byte by byte, ensuring that when you hit a multibyte sequence you have enough characters left to finish up the line, otherwise, you break, and then start over again.

I can't say much more then that because I have never used that encoding before.

Posted: Wed Apr 25, 2007 3:10 pm
by stereofrog
Well, if you have an UTF string, this

Code: Select all

# untested
$utf= '/[\x00-\x7F]|[\xC2-\xDF].|[\xE0-\xEF]..|[\xF0-\xF4].../s';
preg_match_all($utf, $string, $m);
$chars = $m[0];
should split it into individual "characters", 1 to 4 bytes each.

(Hope I understood the question correctly) ;)

Posted: Wed Apr 25, 2007 3:17 pm
by Ambush Commander
stereofrog's method will work, but it's not terribly efficient. Since, according to your edit which I missed previously, you're already doing byte by byte processing, the checks could be run in parallel with the loop. However...

How are you chopping the string?

Posted: Wed Apr 25, 2007 3:18 pm
by Chris Corbyn
Thank you so much both of you! :)

Ambush completely cleared up how I know what sequences to expect, and stereofrog gave a brilliant example for how to grab characters :)

I should be able to work on this over the next couple of days.

/me ponders what impact this will have on other multibyte charsets (maybe I'll do a "function_exists('mb_substr')").

UTF-8 is simple now that I know that about the starting bytes.

Posted: Wed Apr 25, 2007 3:24 pm
by Ambush Commander
Hmm... that's going to be tedious. You'll have to research each of the encodings you want to support and determine their byte lengths. There really ought to be a better way. If you use mb_* things would work out very nicely, due to its multiple character encodings support. If you use iconv, you could convert everything to UTF-8 and then encode it accordingly.

Posted: Wed Apr 25, 2007 3:27 pm
by Chris Corbyn
Ambush Commander wrote:stereofrog's method will work, but it's not terribly efficient. Since, according to your edit which I missed previously, you're already doing byte by byte processing, the checks could be run in parallel with the loop. However...

How are you chopping the string?
With a basic preg_match_all() followed by an implode. I don't check anything other than violation of QP rules (lines cannot end with a "=" less than 3 characters from the end of the string unless it is part of a soft-break.

Code: Select all

/**
   * Return the QP encoded version of a string with no breaks
   * @param string The input to encode
   * @param boolean True if the data we're encoding is binary
   * @return string
   */
  function rawQPEncode($string, $bin=false)
  {
    $ret = "";
    if (!$bin)
    {
      $string = str_replace(array("\r\n", "\r"), "\n", $string);
      $string = str_replace("\n", "\r\n", $string);
    }
    $len = strlen($string);
    for ($i = 0; $i < $len; $i++)
    {
      $val = ord($string{$i});
      //9, 32 = HT, SP; 10, 13 = CR, LF; 33-60 & 62-126 are ok
      // 63 is '?' and needs encoding to go in the headers
      if ((!$bin && ($val == 32 || $val == 9 || $val == 10 || $val == 13))
        || ($val >= 33 && $val <= 60) || ($val >= 62 && $val <= 126)
        && $val != 63)
      {
        $ret .= $string{$i};
      }
      else
      {
        $ret .= sprintf("=%02X", $val);
      }
    }
    return $ret;
  }
I post-process the string to add the soft breaks for reasons I won't go into (related to MIME and line lengths in headers).

Code: Select all

//I don't like the use of preg.. here. Look at changing this?
        preg_match_all('/.{1,'.$line_length.'}([^=]{0,3})?/', $string, $matches);
$result = implode("=" . $le, $matches[0]);
I've never been very happy with the code so I'm glad a bug has been brought to light in a way because I can hopefully write something I feel happy with after knowing what the problems are. I have no issues totally rewriting the QP encoding stuff from scratch.

Posted: Wed Apr 25, 2007 3:30 pm
by Chris Corbyn
Ambush Commander wrote:Hmm... that's going to be tedious. You'll have to research each of the encodings you want to support and determine their byte lengths. There really ought to be a better way. If you use mb_* things would work out very nicely, due to its multiple character encodings support. If you use iconv, you could convert everything to UTF-8 and then encode it accordingly.
We're overlapping here so I'll give it ten mins after posting this, but I cannot change the encodings since users would probably find it too intrusive. People usually choose character encodings for one reason or another specific to their needs (or configuration) so I have to work with what I get. I should maybe start working an a class which tries to identify the encoding then delegates jobs to appropriate functions rather than trying to cover everything in one function. Right now, UTF-8 has my full attention.

EDIT | preg_match("/^.{1}/us", $str, $match) finds the first character, even if it's multibyte UTF-8. It won't work if the encoding is invalid though, which I'm guessing things like Thai-874 would break.

Re: I need a multibyte charset guru :(

Posted: Wed Apr 25, 2007 4:07 pm
by Oren
d11wtq wrote:I currently do an extremely basic:

Code: Select all

for ($i =0; $ < strlen($string); $i++)
{
  $ord = ord($string{$i});
  //Check if it's a permitted byte, then either append it to $result, or sprintf("=%02X", $ord);
}
It's a bit off topic but... you don't really do it like this right? I hope you do the strlen($string) part before and not on each iteration again and again 8O

Posted: Wed Apr 25, 2007 4:08 pm
by Ambush Commander
He doesn't. Check the full code. ;-)

Working on it right now.

Posted: Wed Apr 25, 2007 4:09 pm
by Oren
Dude... that was quick!

Posted: Wed Apr 25, 2007 4:44 pm
by Ambush Commander
I took a whack at it, but we might want to pass this one off to Feyd to be safe. :P The main problem I see is that we can't use my method without merging the rawQPEncode and QPEncode functions, since rawQPEncode, which is doing the byte-by-byte analysis, needs to know when to insert a break (some funky parameter passing would be necessary). Add in the binary handling code makes things a bit messier then they should be, the fact that I'm still not exactly certain how QP encoding works (too lazy to read into it in depth) and the fact that I don't really have a got battery of unit tests, and I think you'll have to fix this one yourself. Sorry.

Posted: Wed Apr 25, 2007 4:49 pm
by Chris Corbyn
Ambush Commander wrote:I took a whack at it, but we might want to pass this one off to Feyd to be safe. :P The main problem I see is that we can't use my method without merging the rawQPEncode and QPEncode functions, since rawQPEncode, which is doing the byte-by-byte analysis, needs to know when to insert a break (some funky parameter passing would be necessary). Add in the binary handling code makes things a bit messier then they should be, the fact that I'm still not exactly certain how QP encoding works (too lazy to read into it in depth) and the fact that I don't really have a got battery of unit tests, and I think you'll have to fix this one yourself. Sorry.
Wow, thanks for trying, you're a legend :)

I'm actually fine with merging the two methods provided I can specify:

Max length of all lines
Max length of first line *
Binary or not (affects line endings more than anything)

* First line sometimes has something like "Subject: =?Q?utf-8? .... before it so the QP encoded bit is actually shorter.

I'm having a shot at it myself too :)

Posted: Thu Apr 26, 2007 2:48 pm
by Chris Corbyn
This is obvious and I don't need to worry about what the charset is. If I know QP encoding turns high value bytes into =XX surely I just need to make sure there's no "=[0-9A-F]{2}" at the end of a line when I cut it, because anything other sequence at the end of a line will just an ascii character :) Obviously I have to make sure I don't chop "=XX" in half itself too.

This task just became so much clearer and much less scary, unless I've completely overlooked something :P

EDIT | No, it was a good idea but if you don't have *any* ascii characters in your string then this won't work. Hmmff :(

Re: I need a multibyte charset guru :(

Posted: Fri Nov 07, 2008 6:34 am
by 1mdm
By the way, this problem still burning. And it's not about QP encoding only. Base64 is also concerned.
I attempted to hack into Swift_Message_Encoder, but I didn't find the way to fix it with small effort…