Page 1 of 1

Parsing Emails

Posted: Mon Dec 18, 2006 5:16 pm
by aliasxneo
Here is the current code I'm using for parsing emails:

Code: Select all

<?php

class mail
{
	var $content;
	var $header;
	var $from;
	var $to;
	var $subject;
	var $message;
	
	var $sql_host = "localhost";
	var $sql_user = "";
	var $sql_pass = "";
	var $sql_db = "";
	
	function read()
	{
		$handle = fopen("php://stdin", "r");
		
		while (!feof($handle)) {
			$this->content .= fread($handle, 1024);
		}
		fclose($handle);
		
		$this->parse();
	}
	
	function parse()
	{
		if (empty($this->content))
		{
			$this->new_error("No content recieved");
		}
		
		$lines = split("\n", $this->content);
		
		foreach ($lines as $line)
		{
			if ($gmessage)
			{
				$this->message .= $line . "\n";
				continue;
			}
			
			if (preg_match("/^Subject: (.*)/", $line, $matches))
			{
				$this->subject = $matches[1];
			}
			
			if (preg_match("/^From: (.*)/", $line, $matches)) 
			{
				$this->from = $matches[1];
			}
			
			if (preg_match("/^To: (.*)/", $line, $matches)) 
			{
				$this->to = $matches[1];
			}
			
			if (trim($line) == "")
			{
				$gmessage = TRUE;
			} else {
				$this->header .= $line . "\n";
			}
		}
		
		if (mysql_connect($this->sql_host, $this->sql_user, $this->sql_pass))
		{
			if (mysql_select_db($this->sql_db))
			{
				$this->message = mysql_escape_string($this->message);
				$sql = "INSERT INTO `messages` (`id`, `from`, `to`, `subject`, `message`, `read`) VALUES ";
				$sql .= "('', '" . $this->from . "', '" . $this->to . "', '" . $this->subject . "', '" . $this->message . "', 0)";
				mysql_query($sql);
			}
		} else {
			$this->new_error("Error connecting to database");
		}
	}
	
	function new_error($message)
	{
		$fh = fopen("/home/coolmail/public_html/error.txt","w+");
		fwrite($fh,$message);
		fclose($fh);
	}
	
}

$mail = new mail();
$mail->read();

?>
It works fine, everything get's inserted, except that I have one problem with the message part. Here is what the data inserted into the database looks like for the message part:
------=_Part_24188_30459578.1166483533205
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

I am just testing this new mal script!

It's cool eh?

------=_Part_24188_30459578.1166483533205
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

I am just testing this new mal script!<br><br>It's cool eh?<br>

------=_Part_24188_30459578.1166483533205--
It shouldn't be like that, and the tutorial that I read said that the message starts after a new line. Do all emails contain something like this? I sent the email using Gmail, is Gmail only specific with this? Also, does anyone have any good ideas on how I should parse this to get the HTML message out? Thanks.

Cheers,
- Josh

Posted: Mon Dec 18, 2006 5:48 pm
by feyd
That's how emails look in reality. If you're not convinced you can see the same in Gmail too: while reading an email there is a down arrow to the right of the reply button (Image). If clicked it will show a dialog that contains "show original" which will open in a new window/tab.

Posted: Mon Dec 18, 2006 5:51 pm
by aliasxneo
So all emails contain a plain text version and an html version? I just want to be sure before I incorporate it into my script since I will be receiving emails from a variety of different systems.

Posted: Mon Dec 18, 2006 5:54 pm
by feyd
No, not ever email contains them. Email has been around far longer than web pages and HTML. ;)

Posted: Mon Dec 18, 2006 6:02 pm
by aliasxneo
So my question is how can I ensure that the message will be displayed properly without things like the content-type tags I showed in my first post?

Posted: Mon Dec 18, 2006 11:09 pm
by feyd
If it's handed (with the full headers) to an email client, it should work as the client will understand what parts to use. If this is for a web based email, the part you extract automatically would depend on the person's personal preferences to text versus html emails. The code required to extract the relevant section(s) can be complicated, for on the basic level quite simple. In the headers a "boundary" will be defined. When a new part is encountered, that boundary will be present.