Page 1 of 1

[Solved] Parsing incoming mail -- charsets

Posted: Tue Apr 04, 2006 6:46 pm
by Bommee
I am trying to use two public license classes to recieve and parse incoming mail through a POP3 server. It works great for us-ascii messages, however as soon as any special characters are present -- that is as soon as the charset changes -- I get broken text.

The class that I am using for retrieving and reading the messages is pop3class from http://www.phpclasses.org/browse/package/2.html. It does a handy job of getting the content and headers etc.

I am reading the content type out of the header and then as pop3class reads the body of the message I am using the charset identified in the header to attempt to convert from ASCII using ConvertCharset from http://www.hotscripts.com/Detailed/37274.html.

In short it isn't working.

I am a Java programmer that is in the process of cross-training in PHP for its flexibility. In Java when declaring a string you can specify the character set. This doesn't appear to be possible in PHP4. Are strings stored as raw bytes and then by default assumed to be ASCII? If you do a character conversion, for example:

Code: Select all

$my_utf_string = ConvertCharset::Convert($my_raw_variable, "us-ascii", "utf8");
does the variable now contain a UTF8 string? Does the server know to treat it that way?

Does anyone know if fgets() returns us-ascii?

Thanks in advance.

Posted: Wed Apr 05, 2006 5:27 am
by Weirdan
Are strings stored as raw bytes?
Exactly.
and then by default assumed to be ASCII?
not really. They are assumed to be raw bytes. Some functions, given these bytes, would assume that they represent ASCII or one of its one-byte extensions (f.e. strlen), other functions may accept additional argument indicating encoding of the string argument(s) (f.e. mb_strlen).
Does anyone know if fgets() returns us-ascii?
Fgets is said to be 'binary-safe' (since PHP 4.3). It reads at most int length bytes from a file pointer, upto the byte which have the value of 0x13 (LF) or EOF (whichever comes first).

Posted: Wed Apr 05, 2006 11:17 am
by Bommee
Thanks for your reply

Here my algorithm for handling messages of the server:
  • - connect to the server and retrieve the list of messages
    - retrieve each message individually parsing the header and body
    - determine the content-type and charset from the header
    - modify the displaying HTML page to use the same charset
    - display the body content
It is really simple and yet it isn't working. I have tried using htmlentities() passing both the charset from the header and nothing with no success either.

(As a side note the htmlentities() solution doesn't work for the applications purpose I was using it to test to see if the string was parsable by a known and tested function. The application is intended to process a catch-all account forwarding messages based on criteria appropriately. The goal is to rewrite the header and pass it forward corrupting the content either by wrapping it in HTML or with broken characters with logging, etc.)

There has to be something that I am missing here. As you say as of PHP 4.3 fgets is byte safe (I am using PHP 4.3.10). But even if I echo the string at the moment it is read to a HTML page with the specified charset or try using htmlentities on the string I get broken characters.

Posted: Wed Apr 05, 2006 12:33 pm
by Bommee
Um...never mind. Thanks for the help.

It turns out that it is my lack of understanding of how 8 bit characters are inserted into email messages. I was expecting to find the content as a series of bytes (8 not 7 bits) and for the most part they are. In the case of extended characters however, the character is escaped with the '=' sign and the hex value appended. Thus the character 'é' is represented as '=E9'. preg_match and str_replace should have me on my way.

Thanks again.