I need a multibyte charset guru :(

XML, Perl, Python, and other languages can be discussed here, even if it isn't PHP (We might forgive you).

Moderator: General Moderators

Post Reply
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

I need a multibyte charset guru :(

Post by Chris Corbyn »

I wouldn't usually do this, but because the initial thread is posting in Swift Mailer's forum I'll just link to it.

viewtopic.php?p=376608#376608

Quick summary of QP encoding:

In a UTF-8 string, the number of bytes per-character can vary, but QP encoding simply turns each individual byte into =XX where XX is the hexadecimal value. Lines cannot exceed 76 characters in QP encoding and end with "=" followed by CRLF if the line has to be chopped. An "=" followed by CRLF is simply disregarded in the decoded output so that the string appears as it was.

Because of the line-length limit I need to break the line and add the "=" to the end (known as a soft break) however, unknown to me until now, you cannot split an individual character across multiple lines, so how can I determine what are whole characters in a multibyte string? :(

I cannot use the mutlibyte functions by the way since this has to work on a vanilla PHP installation.

Hopefully someone can shed load light on understanding the structure of UTF-8?

EDIT | Where's Ambush Commander when ya need him? :P

EDIT | w00t! I found Ambush Commander's article I seemed to remember about and have been following links from that. I'm less in the dark than I was but I still do not know how to figure out what is a character and what is just a byte from a multibyte character. I'll explain further what I do:

I currently do an extremely basic:

Code: Select all

for ($i =0; $ < strlen($string); $i++)
{
  $ord = ord($string{$i});
  //Check if it's a permitted byte, then either append it to $result, or sprintf("=%02X", $ord);
}
If I could change that loop to this I'd be happy:

Code: Select all

while ($char = mb_substr($string, 0, 1))
{
  //I have a character, but it could be any number of bytes
  $string = mb_substr($string, 1); //Move along the string
}
So on that note if someone could help me with a mb_substr equivalent that doesn't require the mb_.. functions I'd be very happy :)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Mmm... intriguing. I know how one would brute force fix the problem, but I'm trying to think of a more elegant way to fix it.

For every UTF-8 character, you know the exact length of the character from the very first byte (because we're talking about hexadecimal pairs, this would be the first two xdigits). From 00-7F, the character is one byte and you can cut with impunity. C2-DF is two bytes, E0-EF is three bytes and F0-F4 is four bytes. So, the brute force method is to assemble the string byte by byte, ensuring that when you hit a multibyte sequence you have enough characters left to finish up the line, otherwise, you break, and then start over again.

I can't say much more then that because I have never used that encoding before.
Last edited by Ambush Commander on Wed Apr 25, 2007 3:17 pm, edited 1 time in total.
User avatar
stereofrog
Forum Contributor
Posts: 386
Joined: Mon Dec 04, 2006 6:10 am

Post by stereofrog »

Well, if you have an UTF string, this

Code: Select all

# untested
$utf= '/[\x00-\x7F]|[\xC2-\xDF].|[\xE0-\xEF]..|[\xF0-\xF4].../s';
preg_match_all($utf, $string, $m);
$chars = $m[0];
should split it into individual "characters", 1 to 4 bytes each.

(Hope I understood the question correctly) ;)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

stereofrog's method will work, but it's not terribly efficient. Since, according to your edit which I missed previously, you're already doing byte by byte processing, the checks could be run in parallel with the loop. However...

How are you chopping the string?
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Thank you so much both of you! :)

Ambush completely cleared up how I know what sequences to expect, and stereofrog gave a brilliant example for how to grab characters :)

I should be able to work on this over the next couple of days.

/me ponders what impact this will have on other multibyte charsets (maybe I'll do a "function_exists('mb_substr')").

UTF-8 is simple now that I know that about the starting bytes.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Hmm... that's going to be tedious. You'll have to research each of the encodings you want to support and determine their byte lengths. There really ought to be a better way. If you use mb_* things would work out very nicely, due to its multiple character encodings support. If you use iconv, you could convert everything to UTF-8 and then encode it accordingly.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Ambush Commander wrote:stereofrog's method will work, but it's not terribly efficient. Since, according to your edit which I missed previously, you're already doing byte by byte processing, the checks could be run in parallel with the loop. However...

How are you chopping the string?
With a basic preg_match_all() followed by an implode. I don't check anything other than violation of QP rules (lines cannot end with a "=" less than 3 characters from the end of the string unless it is part of a soft-break.

Code: Select all

/**
   * Return the QP encoded version of a string with no breaks
   * @param string The input to encode
   * @param boolean True if the data we're encoding is binary
   * @return string
   */
  function rawQPEncode($string, $bin=false)
  {
    $ret = "";
    if (!$bin)
    {
      $string = str_replace(array("\r\n", "\r"), "\n", $string);
      $string = str_replace("\n", "\r\n", $string);
    }
    $len = strlen($string);
    for ($i = 0; $i < $len; $i++)
    {
      $val = ord($string{$i});
      //9, 32 = HT, SP; 10, 13 = CR, LF; 33-60 & 62-126 are ok
      // 63 is '?' and needs encoding to go in the headers
      if ((!$bin && ($val == 32 || $val == 9 || $val == 10 || $val == 13))
        || ($val >= 33 && $val <= 60) || ($val >= 62 && $val <= 126)
        && $val != 63)
      {
        $ret .= $string{$i};
      }
      else
      {
        $ret .= sprintf("=%02X", $val);
      }
    }
    return $ret;
  }
I post-process the string to add the soft breaks for reasons I won't go into (related to MIME and line lengths in headers).

Code: Select all

//I don't like the use of preg.. here. Look at changing this?
        preg_match_all('/.{1,'.$line_length.'}([^=]{0,3})?/', $string, $matches);
$result = implode("=" . $le, $matches[0]);
I've never been very happy with the code so I'm glad a bug has been brought to light in a way because I can hopefully write something I feel happy with after knowing what the problems are. I have no issues totally rewriting the QP encoding stuff from scratch.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Ambush Commander wrote:Hmm... that's going to be tedious. You'll have to research each of the encodings you want to support and determine their byte lengths. There really ought to be a better way. If you use mb_* things would work out very nicely, due to its multiple character encodings support. If you use iconv, you could convert everything to UTF-8 and then encode it accordingly.
We're overlapping here so I'll give it ten mins after posting this, but I cannot change the encodings since users would probably find it too intrusive. People usually choose character encodings for one reason or another specific to their needs (or configuration) so I have to work with what I get. I should maybe start working an a class which tries to identify the encoding then delegates jobs to appropriate functions rather than trying to cover everything in one function. Right now, UTF-8 has my full attention.

EDIT | preg_match("/^.{1}/us", $str, $match) finds the first character, even if it's multibyte UTF-8. It won't work if the encoding is invalid though, which I'm guessing things like Thai-874 would break.
User avatar
Oren
DevNet Resident
Posts: 1640
Joined: Fri Apr 07, 2006 5:13 am
Location: Israel

Re: I need a multibyte charset guru :(

Post by Oren »

d11wtq wrote:I currently do an extremely basic:

Code: Select all

for ($i =0; $ < strlen($string); $i++)
{
  $ord = ord($string{$i});
  //Check if it's a permitted byte, then either append it to $result, or sprintf("=%02X", $ord);
}
It's a bit off topic but... you don't really do it like this right? I hope you do the strlen($string) part before and not on each iteration again and again 8O
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

He doesn't. Check the full code. ;-)

Working on it right now.
User avatar
Oren
DevNet Resident
Posts: 1640
Joined: Fri Apr 07, 2006 5:13 am
Location: Israel

Post by Oren »

Dude... that was quick!
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I took a whack at it, but we might want to pass this one off to Feyd to be safe. :P The main problem I see is that we can't use my method without merging the rawQPEncode and QPEncode functions, since rawQPEncode, which is doing the byte-by-byte analysis, needs to know when to insert a break (some funky parameter passing would be necessary). Add in the binary handling code makes things a bit messier then they should be, the fact that I'm still not exactly certain how QP encoding works (too lazy to read into it in depth) and the fact that I don't really have a got battery of unit tests, and I think you'll have to fix this one yourself. Sorry.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Ambush Commander wrote:I took a whack at it, but we might want to pass this one off to Feyd to be safe. :P The main problem I see is that we can't use my method without merging the rawQPEncode and QPEncode functions, since rawQPEncode, which is doing the byte-by-byte analysis, needs to know when to insert a break (some funky parameter passing would be necessary). Add in the binary handling code makes things a bit messier then they should be, the fact that I'm still not exactly certain how QP encoding works (too lazy to read into it in depth) and the fact that I don't really have a got battery of unit tests, and I think you'll have to fix this one yourself. Sorry.
Wow, thanks for trying, you're a legend :)

I'm actually fine with merging the two methods provided I can specify:

Max length of all lines
Max length of first line *
Binary or not (affects line endings more than anything)

* First line sometimes has something like "Subject: =?Q?utf-8? .... before it so the QP encoded bit is actually shorter.

I'm having a shot at it myself too :)
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

This is obvious and I don't need to worry about what the charset is. If I know QP encoding turns high value bytes into =XX surely I just need to make sure there's no "=[0-9A-F]{2}" at the end of a line when I cut it, because anything other sequence at the end of a line will just an ascii character :) Obviously I have to make sure I don't chop "=XX" in half itself too.

This task just became so much clearer and much less scary, unless I've completely overlooked something :P

EDIT | No, it was a good idea but if you don't have *any* ascii characters in your string then this won't work. Hmmff :(
1mdm
Forum Newbie
Posts: 1
Joined: Fri Nov 07, 2008 6:28 am

Re: I need a multibyte charset guru :(

Post by 1mdm »

By the way, this problem still burning. And it's not about QP encoding only. Base64 is also concerned.
I attempted to hack into Swift_Message_Encoder, but I didn't find the way to fix it with small effort…
Post Reply