Page 1 of 1

[Tricky] Regex for adding soft-breaks to QP encoded strings

Posted: Sun May 06, 2007 6:53 am
by Chris Corbyn
Overview

QP encoded strings represent some bytes in the string as =XX where XX is the ordinal value of the byte in hex. For example, \n (line feed) would be =0A. The maximum length of a line is 76 characters. Encoded lines MUST end with \r\n (CRLF). A line ending with =\r\n is a wrapped line, where the =\r\n is know as a "soft-break". When it is decoded the line break and the "=" get stripped, therefore:

Code: Select all

Hello =
world!
Is the same as

Code: Select all

Hello world!
The problem

Ok, now you know as much as you need to know about QP encoding to tackle this problem :) I receive my encoded string as one long line without the required soft-breaks to keep the lines under 76 characters. I need to be able to work my way along this string, deciding where to add the soft-breaks.

Only rules for this bit

1. The soft break MUST happen before 76 characters are present on the line.
2. The soft break cannot be placed between two encoded bytes directly.
3. Ideally it should be greedy and get as many of those 76 chars on the line as it can, without breaking rules 1 & 2.

So from this string:

Code: Select all

Varov=C3=A1n=C3=AD_p=C5=99ed_expirac=C3=AD_dom=C3=A9ny_logomix
Which decodes to (UTF-8):

Code: Select all

Varování_před_expirací_domény_logomix
These would be valid:

Code: Select all

Varov=C3=A1n=C3=AD=
_p=C5=99ed_expirac=C3=AD_dom=C3=A9ny_logomix

Varov=C3=A1n=C3=AD_p=C5=99=
ed_expirac=C3=AD_dom=C3=A9ny_logomix

Varov=C3=A1n=C3=AD_p=C5=99ed_expir=
ac=C3=AD_dom=C3=A9ny_logomix

Varov=C3=A1n=C3=AD_p=C5=99ed_expirac=C3=AD_dom=C3=A9
ny_logomix

Varov=C3=A1n=C3=AD_p=C5=99ed_expirac=C3=AD_d=
om=C3=A9ny_logomix
But these would not:

Code: Select all

Varov=C3=
=A1n=C3=AD_p=C5=99ed_expirac=C3=AD_dom=C3=A9ny_logomix

Varov=C3=A1n=C3=AD_p=C5=
=99ed_expirac=C3=AD_dom=C3=A9ny_logomix

Varov=C3=A1n=C3=AD_p=C5=99ed_expirac=C3=AD_dom=C3=
=A9ny_logomix

Varov=C3=A1n=C3=AD_p=C5=99ed_expirac=C3=AD_dom=C3=A=
9ny_logomix

Varov=C3=A1n=C3=AD_p=C5=99ed_expirac=C=
3=AD_dom=C3=A9ny_logomix
I only need a pattern which gives me the first 1-76 characters in the string which satisfy those rules because I keep trimming the string myself and re-running the pattern until no string is left.

Here's a pattern which works in PHP5.2:

Code: Select all

preg_match('/^.{1,' . $length . '}(?<=[^=])[^=](?!=[A-F0-9]{2})/', $string, $matches);
But in PHP4 I get this error:
Unexpected PHP error [Compilation failed: lookbehind assertion is not fixed length at offset 16] severity [E_WARNING] in [/Users/d11wtq/public_html/swiftmailer/trunk/php4/lib/Swift/Message/Encoder.php line 195]
The error is stupid because it is fixed-width, it's just not fixed value, but anyway, I need some regex gurus to throw in some more patterns because I've spent too long on this now :)

Don't worry about strings where you only hav =XX=YY=AA=BB=CC hundreds of times in sequence because rule 2 can never be satisfied. I'll worry about that myself, I just need to pick your brains on this pattern first :)

Cheers.

Posted: Sun May 06, 2007 7:15 am
by Chris Corbyn
I've done it again. Did you know you can refactor regex by the way? :P

Code: Select all

preg_match('/^.{1,' . $length . '}[^=]{2}(?!=[A-F0-9]{2})/', $string, $matches);
Not quite the same, but does the same job for me anyway :)