Page 1 of 1

Parsing MIME messages into generic structure

Posted: Tue Feb 17, 2009 4:58 am
by georgeoc
Hi all,

I'd appreciate some advice on a few questions I have about MIME messages. They have arisen for the rewrite of m2f, a 'message hub' which imports and disseminates internet messages between an unlimited number of applications and formats. Currently, the input/output channels include RSS, phpBB & email, but the list will expand in future.

For all received messages, the first step is to convert them to a proprietary format which we have defined for the application. This enables filters to be run on the messages, and provides a base for converting the message to the formats required by the export channels.

The proprietary format (let's call it M2f_Message) has only one 'body' field, but it may be expressed in any suitable language - unformatted text, HTML, Markdown, whatever.

When taking emails as the input, we are faced with the somewhat complex area of MIME. I am starting to define some logic rules which will govern how we select the parts of the MIME email which will become the 'body' field of a M2f_Message object. For example:
  • For a text/plain message, just use the email body
  • For a text/html message, just use the email body
  • For a multipart/alternative message with plain text followed by HTML, ignore the plain text and use the HTML as the message body (as by definition it is a better representation of the user's desired formatting)
  • For a multipart/mixed message consisting of HTML parts and inline images, save the images to the filesystem (giving them a new public URL), rewrite the HTML parts to reflect the new URLs and use this modified HTML as the proprietary message body
  • ...
  • ...
(N.B. - M2f_Message also has an "attachments" field, so any attached files which do not appear inline will be added to the message as attachments)

I'm happy with the list so far, but as far as I can see, the MIME format allows for unlimited combinations and nesting of these possibilities, and others.

This has got me wondering whether there are existing solutions for obtaining the "intended message content" from a MIME email. What I really want is for the body field of the M2f_Message to be the same as the message an email client would display visually to a user - i.e. plain text, HTML, HTML with inline images, text with additional attached files, etc. Email clients must need to define a similar set of rules for which parts to use, and which to ignore.

I'm using Zend_Mail to parse the emails into MIME parts, so I have enough information about the Content-Type, etc., of each part. I wonder if there's an existing solution in PHP for deciding which parts to use and which to ignore, recognising whether the parts represent inline content or an attachment, and recognising the transfer encoding of each part. I know I can do it myself, but I don't want to reinvent the wheel!

Many thanks!

Re: Parsing MIME messages into generic structure

Posted: Tue Feb 17, 2009 9:49 am
by georgeoc
I've worked my way through a few Megabytes of old mbox files, and I have put together a (non-exhaustive) list of the MIME structures I encountered. If I can't find an existing solution (one in PHP, or at least one I can port to PHP), then I'll have to write my own based on these (and further) Use Case Scenarios:

Code: Select all

Use Cases ([b]*[/b] is what will become [i]M2f_Message::body[/i] content, [b]+[/b] is what will become a [i]M2f_Message::attachment[/i])
 
    * text/plain
    
    * text/html
    
    multipart/alternative
        * text/plain
    
    multipart/alternative
        text/plain
        * text/html
 
    multipart/mixed
        * text/plain
 
    multipart/mixed
        * text/html
 
    multipart/mixed
        + application/pdf [attachment]
 
    multipart/mixed
        * text/plain
        + application/pdf [attachment]
 
    multipart/mixed
        + application/msword [attachment]
        + application/pdf [attachment]
 
    multipart/mixed
        multipart/alternative
            text/plain
            * text/html
        + application/pdf
 
    multipart/mixed
        multipart/alternative
            text/plain
            * text/html
        message/rfc822 (inline)
            multipart/alternative
                text/plain
                * text/html [add to message, as it's marked as 'inline']
                
    multipart/mixed
        multipart/alternative
            text/plain
            * text/html
        + message/rfc822 (attachment) [add as attachment - if no filename is given, use "Subject.eml"]
            multipart/alternative
                text/plain
                text/html
 
 
I'm well aware this doesn't cover all possibilities by any stretch of the imagination. This is why I want an existing solution! However, if I'm doing it on my own, can you suggest any more email structures which I'm likely to encounter when looking at the average mailing list?

Re: Parsing MIME messages into generic structure

Posted: Wed Feb 18, 2009 8:45 pm
by georgeoc
Updated the above post with further use cases. I've started implementing the processing of these MIME structures, but I'm still nervous that it's trial and error unless I can build a comprehensive list.

Is there no one here who can help? Would this thread be better suited in Theory and Design?

Re: Parsing MIME messages into generic structure

Posted: Wed Feb 18, 2009 11:23 pm
by Chris Corbyn
Read RFC 2822, and 2045-9. Then start building classes that follow a composite approach and can nest inside one another.

I'm planning on doing this with Swift Mailer. Something along the lines of:

Code: Select all

$messageParser = new Swift_MessageParser();
$message = $messageParser->parse($messageSource);
 
echo $message->getSubject();
 
foreach ($message->getChildren() as $child) {
  printf("Child of content type %s\n", $child->getContentType());
}

Re: Parsing MIME messages into generic structure

Posted: Thu Feb 19, 2009 6:39 am
by georgeoc
Thanks Chris. I've been reading the RFCs for a while already.

Here's what I've done so far:

Code: Select all

    private function _setBody($message, $email)
    {
        if ($email->isMultipart())
        {
            $this->_processMultipart($message, $email);
        }
        else
        {
            $this->_processPart($message, $email);
        }
    }
    
    private function _processMultipart($message, $multipart)
    {
        $numParts = $multipart->countParts();
        if (!$numParts) return;
        
        switch ($this->_getHeaderField($multipart, 'contentType'))
        {
            case 'multipart/alternative';
                do
                {
                    $part = $multipart->getPart($numParts--);
                    $found = $this->_processPart($message, $part);
                }
                while (!$found AND $numParts);
                break;
            
            case 'multipart/mixed';
                for ($i = 1; $i <= $numParts; $i++)
                {
                    $part = $multipart->getPart($i);
                    $this->_processPart($message, $part);
                }
                break;
            
            default: // @todo: throw Exception?
                break;
        }
    }
    
    private function _processPart($message, $part)
    {
        if ($part->isMultipart())
        {
            return $this->_processMultipart($message, $part);
        }
 
        switch ($this->_getHeaderField($part, 'contentType'))
        {
            case 'text/html':
                $message->setBody($part->getContent(), M2f_Message::FORMAT_HTML);
                return true;
            
            case 'text/plain':
                $message->setBody($part->getContent());
                return true;
                
            case '':
                $message->setBody($part->getContent());
                return;
 
            default:
                $this->_getHeaderField($part, 'contentDisposition');
                
                $this->_addAttachment($message, $part);
                return;
        }
    }
    
    private function _addAttachment($message, $part)
    {
        $attachment = $message->newAttachment();
        $attachment->setData($part->getContent());
    }
 
I think as a general rule, I want to add anything marked as "Content-Disposition: inline" to the message body, plus any text/plain or text/html parts not marked as "Content-Disposition: attachment". Then most other parts will become attachments, except when they are superseded by a more complex alternative.

Doe that sound right?

Re: Parsing MIME messages into generic structure

Posted: Thu Feb 19, 2009 4:04 pm
by Chris Corbyn
georgeoc wrote:I think as a general rule, I want to add anything marked as "Content-Disposition: inline" to the message body, plus any text/plain or text/html parts not marked as "Content-Disposition: attachment". Then most other parts will become attachments, except when they are superseded by a more complex alternative.

Doe that sound right?
Sort of. multipart/mixed is different though. If you have:

Code: Select all

Content-Type: multipart/mixed; boundary=mix
 
--mix
Content-Type: text/plain
 
One
--mix
Content-Type: text/plain
 
Two
--mix
Content-Type: text/plain
 
Three
--mix
Content-Type: application/pdf; name=doc.pdf
Content-Disposition: attachment; filename=doc.pdf
Content-Transfer-Encoding: base64
 
<base64 data here>
--mix--
Then all of the text/plain parts will be displayed in the order in which they appear.

There are also some "partial" content types described in RFC 2045 (off the top of my head).

Re: Parsing MIME messages into generic structure

Posted: Fri Feb 20, 2009 1:01 pm
by georgeoc
OK, form further RFC research I'm fairly happy now with the message structure.

However, I have another question about decoding each part, with regard to the Content-Transfer-Encoding header. According to the RFC, I could expect to see the following encodings:
  • 7bit
  • quoted-printable
  • base64
  • 8bit
  • binary
If I want to save the attachments as files in a directory, I'll need to decode them. This really isn't my area of expertise, and I'm not sure exactly how to do that.

Quoted-printable
Zend_Mime_Decode offers a decodeQuotedPrintable() static method, as follows:

Code: Select all

       return iconv_mime_decode($string, ICONV_MIME_DECODE_CONTINUE_ON_ERROR);
Is this any different to using:

Code: Select all

return quoted_printable_decode($string)
or this, from PEAR Mime_Decode:

Code: Select all

   function _quotedPrintableDecode($input)
    {
        // Remove soft line breaks
        $input = preg_replace("/=\r?\n/", '', $input);
 
        // Replace encoded characters
    $input = preg_replace('/=([a-f0-9]{2})/ie', "chr(hexdec('\\1'))", $input);
 
        return $input;
    }
 
???

Base64
I assume I'm OK using base64_decode() ???

7bit, 8bit & binary
I don't know what to do with these. Do they need decoding? Or are they already in the correct format for saving as a file?


Sorry if these are stupid questions, but this is a bit of a mystery to me!

Re: Parsing MIME messages into generic structure

Posted: Fri Feb 20, 2009 10:55 pm
by Chris Corbyn
7bit, 8bit and binary should all be fine (in practice you'll never see binary... most MX servers won't transport binary data without messing it up.

quoted_printable_decode() should work fine and base64_decode() too.

Re: Parsing MIME messages into generic structure

Posted: Sat Feb 21, 2009 2:50 am
by georgeoc
Great - thanks so much Chris!

It's certainly easier to decode QP than encode it! (as I see from your Swift 4 optimization discussions).

Re: Parsing MIME messages into generic structure

Posted: Sat Feb 21, 2009 3:01 am
by Chris Corbyn
georgeoc wrote:Great - thanks so much Chris!

It's certainly easier to decode QP than encode it! (as I see from your Swift 4 optimization discussions).
Yeah it's because when you encode you have to pay really close attention to the character encodings. It'd have been useful if PHP had quoted_printable_encode() but I'm pretty sure the reason it doesn't is because it's not unicode-aware.