Parsing MIME messages into generic structure
Posted: Tue Feb 17, 2009 4:58 am
Hi all,
I'd appreciate some advice on a few questions I have about MIME messages. They have arisen for the rewrite of m2f, a 'message hub' which imports and disseminates internet messages between an unlimited number of applications and formats. Currently, the input/output channels include RSS, phpBB & email, but the list will expand in future.
For all received messages, the first step is to convert them to a proprietary format which we have defined for the application. This enables filters to be run on the messages, and provides a base for converting the message to the formats required by the export channels.
The proprietary format (let's call it M2f_Message) has only one 'body' field, but it may be expressed in any suitable language - unformatted text, HTML, Markdown, whatever.
When taking emails as the input, we are faced with the somewhat complex area of MIME. I am starting to define some logic rules which will govern how we select the parts of the MIME email which will become the 'body' field of a M2f_Message object. For example:
I'm happy with the list so far, but as far as I can see, the MIME format allows for unlimited combinations and nesting of these possibilities, and others.
This has got me wondering whether there are existing solutions for obtaining the "intended message content" from a MIME email. What I really want is for the body field of the M2f_Message to be the same as the message an email client would display visually to a user - i.e. plain text, HTML, HTML with inline images, text with additional attached files, etc. Email clients must need to define a similar set of rules for which parts to use, and which to ignore.
I'm using Zend_Mail to parse the emails into MIME parts, so I have enough information about the Content-Type, etc., of each part. I wonder if there's an existing solution in PHP for deciding which parts to use and which to ignore, recognising whether the parts represent inline content or an attachment, and recognising the transfer encoding of each part. I know I can do it myself, but I don't want to reinvent the wheel!
Many thanks!
I'd appreciate some advice on a few questions I have about MIME messages. They have arisen for the rewrite of m2f, a 'message hub' which imports and disseminates internet messages between an unlimited number of applications and formats. Currently, the input/output channels include RSS, phpBB & email, but the list will expand in future.
For all received messages, the first step is to convert them to a proprietary format which we have defined for the application. This enables filters to be run on the messages, and provides a base for converting the message to the formats required by the export channels.
The proprietary format (let's call it M2f_Message) has only one 'body' field, but it may be expressed in any suitable language - unformatted text, HTML, Markdown, whatever.
When taking emails as the input, we are faced with the somewhat complex area of MIME. I am starting to define some logic rules which will govern how we select the parts of the MIME email which will become the 'body' field of a M2f_Message object. For example:
- For a text/plain message, just use the email body
- For a text/html message, just use the email body
- For a multipart/alternative message with plain text followed by HTML, ignore the plain text and use the HTML as the message body (as by definition it is a better representation of the user's desired formatting)
- For a multipart/mixed message consisting of HTML parts and inline images, save the images to the filesystem (giving them a new public URL), rewrite the HTML parts to reflect the new URLs and use this modified HTML as the proprietary message body
- ...
- ...
I'm happy with the list so far, but as far as I can see, the MIME format allows for unlimited combinations and nesting of these possibilities, and others.
This has got me wondering whether there are existing solutions for obtaining the "intended message content" from a MIME email. What I really want is for the body field of the M2f_Message to be the same as the message an email client would display visually to a user - i.e. plain text, HTML, HTML with inline images, text with additional attached files, etc. Email clients must need to define a similar set of rules for which parts to use, and which to ignore.
I'm using Zend_Mail to parse the emails into MIME parts, so I have enough information about the Content-Type, etc., of each part. I wonder if there's an existing solution in PHP for deciding which parts to use and which to ignore, recognising whether the parts represent inline content or an attachment, and recognising the transfer encoding of each part. I know I can do it myself, but I don't want to reinvent the wheel!
Many thanks!