regex'ing multi line ending in a double carriage return

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
ZaphodQB
Forum Newbie
Posts: 7
Joined: Thu Apr 24, 2008 6:23 pm

regex'ing multi line ending in a double carriage return

Post by ZaphodQB »

I am trying to insert a line at the top of an email using ereg_replace to replace everything from the
"Content-Type: text/plain;"
to the double carriage return (\n\n)
between "quoted-printable" and original start of the email,
with the matched data+ "the line I want to add"
so that the example here;
  • ------_=_NextPart_002_01C8B165.D1A0F25B
    Content-Type: text/plain;
    charset="us-ascii"
    Content-Transfer-Encoding: quoted-printable

    The original start of the email text
becomes;
  • ------_=_NextPart_002_01C8B165.D1A0F25B
    Content-Type: text/plain;
    charset="us-ascii"
    Content-Transfer-Encoding: quoted-printable

    the line I want to add
    The original start of the email text
My limited experience with regex is failing me.
Anyone got the answer?
Thanks
ZaphodQB
Forum Newbie
Posts: 7
Joined: Thu Apr 24, 2008 6:23 pm

Re: regex'ing multi line ending in a double carriage return

Post by ZaphodQB »

maybe I should add to this.

I have the body of the eMail as a string which I have extracted using Net_POP3.

$msgBody = htmlspecialchars($pop3->getBody(1));
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: regex'ing multi line ending in a double carriage return

Post by GeertDD »

Code: Select all

preg_replace('~(?<=\n\n)~', 'line you want to add'."\n", $string, 1);
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: regex'ing multi line ending in a double carriage return

Post by prometheuzz »

ZaphodQB wrote:I am trying to insert a line at the top of an email using ereg_replace to replace ...
Is there a specific need to use ereg_replace instead of preg_replace? The latter is more widely used, if I'm not mistaken, and also the one I know how to work with ; )
Here's a way:

Code: Select all

#!/usr/bin/php
<?php
$text = <<< BLOCK
------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
 
The original start of the email text
BLOCK;
print preg_replace('/(Content-Transfer-Encoding:.*?\n)\n+/', "$1\n\n", $text);
?>
ZaphodQB
Forum Newbie
Posts: 7
Joined: Thu Apr 24, 2008 6:23 pm

Re: regex'ing multi line ending in a double carriage return

Post by ZaphodQB »

I don't know if this will help but here is a very stripped down version of my attempt to just get the match on the text/plain portion of the message body.

Code: Select all

 
<?php
set_include_path(get_include_path() . PATH_SEPARATOR . "/home/[i]myHostedsite[/i]/php");
 
require_once 'Net/POP3.php';
require_once 'Mail/mimeDecode.php';
 
$msgNumber=2;
 
$pop3 =& new Net_POP3();
 
if($pop3->connect([i]'mail.MyDomain.com'[/i], 110)){echo "connected<br>";}else{echo "NOT connected<br>";}
 
if($pop3->login('[i]MyeMailAddress@MyDomain.com[/i]', '[i]MyPassword[/i]')){echo "logged in<br>";}else{echo "logged in<br>";}
 
$headersArray = $pop3->getParsedHeaders($msgNumber);
 
$msgBody = htmlspecialchars($pop3->getBody($msgNumber));
 
$pop3->disconnect();
 
echo "content type is ".$headersArray['Content-Type']."<br><br>";
 
$pattern="/boundary\s*=\s*[\"|'](.*)[\"|']/";
 
preg_match($pattern,$headersArray['Content-Type'],$matches);
 
echo "<pre> boundary = ".$matches[1]."</pre>";
 
$boundary=$matches[1];
 
//$pattern="/(Content-Type:.*?text\/plain;.*?\n)\n+?/"; echo $pattern;
 
preg_match('/(Content-Type:.*?text\/plain;.*?\n)\n+/',$msgBody,$matches);
 
echo "<pre>";print_r($matches);echo "</pre>";
 
 
echo "<pre> msg body = ".$msgBody."</pre><br /><br /><hr>";
 
?>
 
Which produces the following output


  • connected
    logged in
    content type is multipart/alternative; boundary="ABCD-T0TH053F055AA6EB3A97633D853E00-EFGH"


    boundary = ABCD-T0TH053F055AA6EB3A97633D853E00-EFGH
    /(Content-Type:.*?text\/plain;.*? ) +?/
    Array
    (
    )

    msg body = --ABCD-T0TH053F055AA6EB3A97633D853E00-EFGH
    Content-Type: text/plain; charset=us-ascii
    Content-Transfer-Encoding: quoted-printable

    Capital One--what's in your wallet?
    http://email.capitalone.com/
    <snip>
ZaphodQB
Forum Newbie
Posts: 7
Joined: Thu Apr 24, 2008 6:23 pm

Re: regex'ing multi line ending in a double carriage return

Post by ZaphodQB »

my bad I miss copied it
Last edited by ZaphodQB on Sun May 11, 2008 12:46 am, edited 1 time in total.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: regex'ing multi line ending in a double carriage return

Post by prometheuzz »

ZaphodQB wrote: ...
This does not work, it raises an error during parsing.

Parse error: syntax error, unexpected '<' in /home/a2426477/public_html/test.php on line 2
It works fine. With the following input:

Code: Select all

------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
 
The original start of the email text
I get the following output:

Code: Select all

------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
 
 
The original start of the email text
Perhaps you left the "#!/usr/bin/php" part when copy-pasting it? That line only needs to stay if you're executing it from the shell (and the executable 'php' resides in #!/usr/bin/)

And in case the <<< is causing problems, you can try:

Code: Select all

<?php
$text = "------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset=\"us-ascii\"
Content-Transfer-Encoding: quoted-printable
 
The original start of the email text";
print preg_replace('/(Content-Transfer-Encoding:.*?\n)\n+/', "$1\n\n", $text);
?>
ZaphodQB
Forum Newbie
Posts: 7
Joined: Thu Apr 24, 2008 6:23 pm

Re: regex'ing multi line ending in a double carriage return

Post by ZaphodQB »

prometheuzz wrote:
ZaphodQB wrote: ...
This does not work, it raises an error during parsing.

Parse error: syntax error, unexpected '<' in /home/a2426477/public_html/test.php on line 2
It works fine. With the following input:

Code: Select all

------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
 
The original start of the email text
I get the following output:

Code: Select all

------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
 
 
The original start of the email text
Perhaps you left the "#!/usr/bin/php" part when copy-pasting it? That line only needs to stay if you're executing it from the shell (and the executable 'php' resides in #!/usr/bin/)

And in case the <<< is causing problems, you can try:

Code: Select all

<?php
$text = "------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset=\"us-ascii\"
Content-Transfer-Encoding: quoted-printable
 
The original start of the email text";
print preg_replace('/(Content-Transfer-Encoding:.*?\n)\n+/', "$1\n\n", $text);
?>
I got it working, but, and maybe I am missing something, it is not really changing the file and inserting anything, the text remains the same.
Even when adding a line of text to the replacement string like so

Code: Select all

<?php
$text = <<< BLOCK
------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
 
The original start of the email text
BLOCK;
print "<pre>".preg_replace('/(Content-Transfer-Encoding:.*?\n)\n+/', "$1\n\nMy line of text", $text)."</pre>";
 
?>
the output is still unchanged
------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

The original start of the email text
and, playing with it to try and make it match starting at "Content-Type: text/plain;" which is what I need because that is the thing I am looking for in order to be sure I am entering the text in the correct part of the multi part email, fails to match more then the "Content-Type: text/plain;" line.

Let me try to explain again;

emails can be multi part containg both plain text and html portions.

I want to be able to regex the entire body of the email and
1. find the header which says this is the plain text portion
2. find the end of that header/start of the actual message (by the specs that is 2 carriage returns)
3. insert a line of text directly after those 2 carriage returns.

Also;

I want to be able to regex the entire body of the email and
1. find the header which says this is the html portion
2. fin this section I need to go a little deeper in to the message and find the <body> tag
3. insert a line of text directly after that <body> tag.

So it is imperitive that the match starts at the Content-Type header field, in order to know which part of the email I am dealing with, and match all the way to the 2 carriage returns and or the <body> tag, inclusive. Once I have extracted a copy of that data I can append my line of text and then replace the existing "several lines" with the new data which has my new line of text on the end.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: regex'ing multi line ending in a double carriage return

Post by prometheuzz »

ZaphodQB wrote:...
I got it working, but, and maybe I am missing something, it is not really changing the file and inserting anything, the text remains the same.
Even when adding a line of text to the replacement string like so
...
No, you're right. Apparently I posted another version of the regex that did work. Sorry about that.
Here's one that does adds an empty line.

Code: Select all

#!/usr/bin/php
<?php
$text = "------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset=\"us-ascii\"
Content-Transfer-Encoding: quoted-printable
 
The original start of the email text";
print "text =\n$text\n\n";
print "adjusted text =\n" . preg_replace('/(Content-Transfer-Encoding:[^\n]+)\n/', "$1\n\n", $text);                   
/* output:
text =
------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
 
The original start of the email text
 
adjusted text =
------_=_NextPart_002_01C8B165.D1A0F25B
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
 
 
The original start of the email text
*/
?>

ZaphodQB wrote:...
and, playing with it to try and make it match starting at "Content-Type: text/plain;" which is what I need because that is the thing I am looking for in order to be sure I am entering the text in the correct part of the multi part email, fails to match more then the "Content-Type: text/plain;" line.

Let me try to explain again;

emails can be multi part containg both plain text and html portions.

I want to be able to regex the entire body of the email and
1. find the header which says this is the plain text portion
2. find the end of that header/start of the actual message (by the specs that is 2 carriage returns)
3. insert a line of text directly after those 2 carriage returns.

Also;

I want to be able to regex the entire body of the email and
1. find the header which says this is the html portion
2. fin this section I need to go a little deeper in to the message and find the <body> tag
3. insert a line of text directly after that <body> tag.

So it is imperitive that the match starts at the Content-Type header field, in order to know which part of the email I am dealing with, and match all the way to the 2 carriage returns and or the <body> tag, inclusive. Once I have extracted a copy of that data I can append my line of text and then replace the existing "several lines" with the new data which has my new line of text on the end.
I'm sure you've explained it properly, but I am not familiar with the format of e-mail messages, so the above explanation is not so clear for me.
Some example in- and output would help. Like this: given the following e-mail message as a string:
...

I want to match the following parts of that message:
...

and want to replace part X with this:
...

resulting in the final string:
...

But perhaps you can adjust the snippet in this message with something that suits your needs.
ZaphodQB
Forum Newbie
Posts: 7
Joined: Thu Apr 24, 2008 6:23 pm

Re: regex'ing multi line ending in a double carriage return

Post by ZaphodQB »

OK I will supply some samples (short ones since emails in the raw form can be hundreds of lines long)
That being said this is still going to be long.

Here is the MIME header and very top of the message for a text/plain section of an email;
  • This is a multi-part message in MIME format.

    ------=_NextPart_000_0005_01C8B365.50073EA0
    Content-Type: text/plain;
    charset="iso-8859-1"
    Content-Transfer-Encoding: quoted-printable

    This is the original top of the message as it is received before I add my line of text.

This is the text/html section of the same email message

  • ------=_NextPart_000_0005_01C8B365.50073EA0
    Content-Type: text/html;
    charset="iso-8859-1"
    Content-Transfer-Encoding: quoted-printable

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
    <HTML><HEAD>
    <META http-equiv=3DContent-Type content=3D"text/html; =
    charset=3Diso-8859-1">
    <META content=3D"MSHTML 6.00.6000.16640" name=3DGENERATOR>
    <STYLE></STYLE>
    </HEAD>
    <BODY bgColor=3D#ffffff>
    <DIV><FONT face=3DArial size=3D2>This is the original top of the message =
    as it is=20
    received before I add my line of text.</FONT></DIV></BODY></HTML>

Here is another eMail of the multi part type showing how wildly different the two portions can be

  • This is a multi-part message in MIME format.

    ------=_NextPart_000_0000_01C8B36B.C84F9370
    Content-Type: text/plain;
    charset="iso-8859-1"
    Content-Transfer-Encoding: 7bit

    This is the original top of the message

This is the text/html portion of the second email

  • ------=_NextPart_000_0000_01C8B36B.C84F9370
    Content-Type: text/html;
    charset="iso-8859-1"
    Content-Transfer-Encoding: quoted-printable

    <html xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
    xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
    xmlns=3D"http://www.w3.org/TR/REC-html40">

    <head>
    <meta http-equiv=3DContent-Type content=3D"text/html; =
    charset=3Diso-8859-1">
    <meta name=3DProgId content=3DWord.Document>
    <meta name=3DGenerator content=3D"Microsoft Word 10">
    <meta name=3DOriginator content=3D"Microsoft Word 10">
    <link rel=3DFile-List href=3D"cid:filelist.xml@01C8B36B.C8080320">
    <!--[if gte mso 9]><xml>
    <o:OfficeDocumentSettings>
    <o:DoNotRelyOnCSS/>
    </o:OfficeDocumentSettings>
    </xml><![endif]--><!--[if gte mso 9]><xml>
    <w:WordDocument>
    <w:View>Print</w:View>
    <w:SpellingState>Clean</w:SpellingState>
    <w:GrammarState>Clean</w:GrammarState>
    <w:EnvelopeVis/>
    <w:Compatibility>
    <w:BreakWrappedTables/>
    <w:SnapToGridInCell/>
    <w:WrapTextWithPunct/>
    <w:UseAsianBreakRules/>
    </w:Compatibility>
    <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
    </w:WordDocument>
    </xml><![endif]-->
    <style>
    <!--
    /* Style Definitions */
    p.MsoNormal, li.MsoNormal, div.MsoNormal
    {mso-style-parent:"";
    margin:0in;
    margin-bottom:.0001pt;
    mso-pagination:widow-orphan;
    font-size:12.0pt;
    font-family:"Times New Roman";
    mso-fareast-font-family:"Times New Roman";}
    @page Section1
    {size:8.5in 11.0in;
    margin:1.0in 1.25in 1.0in 1.25in;
    mso-header-margin:.5in;
    mso-footer-margin:.5in;
    mso-paper-source:0;}
    div.Section1
    {page:Section1;}
    -->
    </style>
    <!--[if gte mso 10]>
    <style>
    /* Style Definitions */=20
    table.MsoNormalTable
    {mso-style-name:"Table Normal";
    mso-tstyle-rowband-size:0;
    mso-tstyle-colband-size:0;
    mso-style-noshow:yes;
    mso-style-parent:"";
    mso-padding-alt:0in 5.4pt 0in 5.4pt;
    mso-para-margin:0in;
    mso-para-margin-bottom:.0001pt;
    mso-pagination:widow-orphan;
    font-size:10.0pt;
    font-family:"Times New Roman";}
    </style>
    <![endif]-->
    </head>

    <body lang=3DEN-US style=3D'tab-interval:.5in'>
    This is the original top of the message

As you can see they can vary greatly. but there are somethings that they have in common due to the specs oultlined for email standard packaging, which lends them to regexing

So here we go, in either of the proceeding emails I want to use
preg_match($pattern,$emailbody,$matches);
to load everything from the "content-type: text/plain;" all the way to the \n\n (two returns)
which (per email specs) seperate the MIME header from the actual message.
Then user preg_replace($matches[0],$matches[0]."My new Line of extra text",$emailbody);

So that the two text/plain portions become;


  • This is a multi-part message in MIME format.

    ------=_NextPart_000_0005_01C8B365.50073EA0
    Content-Type: text/plain;
    charset="iso-8859-1"
    Content-Transfer-Encoding: quoted-printable

    My new Line of extra text
    This is the original top of the message as it is received before I add my line of text.

and

  • This is a multi-part message in MIME format.

    ------=_NextPart_000_0000_01C8B36B.C84F9370
    Content-Type: text/plain;
    charset="iso-8859-1"
    Content-Transfer-Encoding: 7bit

    My new Line of extra text
    This is the original top of the message

Now to keep this from getting even longer;
I want to do the same to the "text/html" portions as wel, so that which ever way the recipient reads it, plain text or html they will see the extra text.
However, for the "text/html" portions the match can't stop at the end of the header (\n\n) it has to match all the way down to the <body> tag so that when the My new Line of extra text is appended to the end of $matches[0], it will fall inside the body of the html document.

like so (first example only)



This is the text/html section of the same email message

  • ------=_NextPart_000_0005_01C8B365.50073EA0
    Content-Type: text/html;
    charset="iso-8859-1"
    Content-Transfer-Encoding: quoted-printable

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
    <HTML><HEAD>
    <META http-equiv=3DContent-Type content=3D"text/html; =
    charset=3Diso-8859-1">
    <META content=3D"MSHTML 6.00.6000.16640" name=3DGENERATOR>
    <STYLE></STYLE>
    </HEAD>
    <BODY bgColor=3D#ffffff>
    My new Line of extra text
    <DIV><FONT face=3DArial size=3D2>This is the original top of the message =
    as it is=20
    received before I add my line of text.</FONT></DIV></BODY></HTML>

Hope that all helps.
and thanks for the help as well.
ZaphodQB
Forum Newbie
Posts: 7
Joined: Thu Apr 24, 2008 6:23 pm

Re: regex'ing multi line ending in a double carriage return

Post by ZaphodQB »

Whooo hoo! Got the 1st part figured out for the text/plain portions!

I can change;

  • This is a multi-part message in MIME format.

    ------=_NextPart_000_0005_01C8B365.50073EA0
    Content-Type: text/plain;
    charset="iso-8859-1"
    Content-Transfer-Encoding: quoted-printable

    This is the original top of the message as it is received before I add =
    my line of text.
    ------=_NextPart_000_0005_01C8B365.50073EA0
using;

Code: Select all

preg_match("/(content-type:\s*text\/plain;.*?)\r?\n\r?\n/is",$msgBody,$matches);
$replacement=$matches[0]."My new line of text\n";
$msgBody = preg_replace("/(content-type:\s*text\/plain;.*?)\r?\n\r?\n/is",$replacement,$msgBody);
into;
  • This is a multi-part message in MIME format.

    ------=_NextPart_000_0005_01C8B365.50073EA0
    Content-Type: text/plain;
    charset="iso-8859-1"
    Content-Transfer-Encoding: quoted-printable

    My new line of text
    This is the original top of the message as it is received before I add =
    my line of text.
    ------=_NextPart_000_0005_01C8B365.50073EA0
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: regex'ing multi line ending in a double carriage return

Post by prometheuzz »

Yes, that cleared things up, thanks.
Have a look at this example:

Code: Select all

#!/usr/bin/php
<?php
$messages = array(
"------=_NextPart_000_0005_01C8B365.50073EA0
Content-Type: text/plain;
charset=\"iso-8859-1\"
Content-Transfer-Encoding: quoted-printable
 
This is the original top of the message as it is received before I add my line of text."
,
"------=_NextPart_000_0005_01C8B365.50073EA0
Content-Type: text/html;
charset=\"iso-8859-1\"
Content-Transfer-Encoding: quoted-printable
 
<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D\"text/html; =
charset=3Diso-8859-1\">
<META content=3D\"MSHTML 6.00.6000.16640\" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2>This is the original top of the message =
as it is=20
received before I add my line of text.</FONT></DIV></BODY></HTML>"
,
"------=_NextPart_000_0000_01C8B36B.C84F9370
Content-Type: text/plain;
charset=\"iso-8859-1\"
Content-Transfer-Encoding: 7bit
 
This is the original top of the message"
);
$regex = '{
    (
        ^[^\n]+\n+
        Content-Type:[^\n]+\n+
        charset=[^\n]+\n+
        Content-Transfer-Encoding:[^\n]+\n
        (?:
            \n*
            <!DOCTYPE\s+HTML[^\n]+\n+
            (?:[^\n]*\n)+
            <BODY\s+[^>]+>
        )?
    )
}ix';
 
foreach($messages as $message) {
    $adjusted = preg_replace($regex, "$1\nEXTRA-TEXT", $message);
    print "$adjusted\n==================================================\n";
}
 
/* output:
------=_NextPart_000_0005_01C8B365.50073EA0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
 
EXTRA-TEXT
This is the original top of the message as it is received before I add my line of text.
==================================================
------=_NextPart_000_0005_01C8B365.50073EA0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.6000.16640" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
EXTRA-TEXT
<DIV><FONT face=3DArial size=3D2>This is the original top of the message =
as it is=20
received before I add my line of text.</FONT></DIV></BODY></HTML>
==================================================
------=_NextPart_000_0000_01C8B36B.C84F9370
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
 
EXTRA-TEXT
This is the original top of the message
==================================================
*/
?>
Post Reply