A "StripHTML()" method to produce a plain-text version?

Swift Mailer is a fantastic library for sending email with php. Discuss this library or ask any questions about it here.

Moderators: Chris Corbyn, General Moderators

Post Reply
HHahn
Forum Commoner
Posts: 43
Joined: Mon Mar 02, 2009 9:16 am
Location: Veldhoven, Netherlands

A "StripHTML()" method to produce a plain-text version?

Post by HHahn »

I made a function StripHTML(), which receives the HTML message block as a parameter and strips off all HTML, replacing certain tags by e.g. tab characters to simulate a "table".
The function is likely to be pretty incmplete yet, but it does work.
In 3.3.3 I had it in an extension class to the Swift class. In 4.0.0 I made it a separate function, but it would be better if it were a method within a class.

Might it be a good idea to have something like that as a standard facility in in SwiftMailer?

Usage:

Code: Select all

$Msg->setBody($Content, "text/html");
$Msg->addPart(StripHtml ($Content), "text/plain"); 
Here is the current version of the code. It is a bit makeshift yet, but it works:

Code: Select all

//------ StripHtml(): ------
function StripHtml ($Text)
{
  $T = str_replace  ("/\r\n/",                    "\n",    $Text);    // "\r\n" to "\n"
  $T = preg_replace ("/[\x20\x9]*<td[^>]*>\n/"  , "\t",    $T);       // "<td>\n" to "\t"
  $T = preg_replace ("/[\x20\x9]*<td[^>]*>/",     "\t",    $T);       // "<td>" to "\t"
  $T = preg_replace ("/[\x20\x9]*<\/tr[^>]*>\n/", "\n",    $T);       // "</tr>\n" to "\n"
  $T = preg_replace ("/[\x20\x9]*<\/tr[^>]*>/",   "\n",    $T);       // "</tr>" to "\n"
  $T = preg_replace ("/<\/t[^>]*>\n/",            "<xxx>", $T);       // mark "</td>\n", "<tr>\n" etc.
  $T = preg_replace ("/<\/t[^>]*>/",              "<xxx>", $T);       // mark "</td>", "<tr>" etc.
  $T = preg_replace ("/<t[^>]+>\n/",              "<xxx>", $T);       // mark "<table>", "<tr>" etc.
  $T = preg_replace ("/[\x20\x9]*<xxx>/",         "<xxx>", $T);       // mark indentation before "<t...>"
  $T = preg_replace ("/<br[^>]*>\n/",             "\n",    $T);       // "<br>\n" to "\n"
  $T = preg_replace ("/<\/p[^>]*>\n/",            "\n",    $T);       // "</p>\n" to "\n"
  $T = preg_replace ("/<\/p[^>]*>/",              "\n",    $T);       // "</p>" to "\n"
  $T = preg_replace ("/<\/h\d[^>]*>\n?/",         "\n",    $T);       // "</h1>" etc. to "\n"
  $T = preg_replace ("/<\/?b>/",                  "*",     $T);       // "<b>" and "</b>" to "*"
  $T = preg_replace ("/<\/?i>/",                  "/",     $T);       // "<i>" and "</i>" to "/"
  $T = preg_replace ("/<[^>]*>/",                 "",      $T);       // remove all other HTML-tags, ...
                                                                      // ... including the temporary "<xxx>"
  return ($T);
}    // "StripHtml()"
 
//--------------------------
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: A "StripHTML()" method to produce a plain-text version?

Post by Chris Corbyn »

If I add anything like this, it will most likely be in the form of utility classes but it's something that would be useful yep :)

There is actually another HTML -> Text converter available that is pretty well established from what I remember.
xdecock
Forum Commoner
Posts: 37
Joined: Tue Mar 18, 2008 8:16 am

Re: A "StripHTML()" method to produce a plain-text version?

Post by xdecock »

I personally use lynx to produce the text version, but this requires exec_* right & lynx binary.

if this idea might help you, otherwise search for HTML => Markdown converter
HHahn
Forum Commoner
Posts: 43
Joined: Mon Mar 02, 2009 9:16 am
Location: Veldhoven, Netherlands

Re: A "StripHTML()" method to produce a plain-text version?

Post by HHahn »

I have googled for "HTML markdown converter", but "Markdown" seems to be a kind of (very) simplyfied markup language (more or less like the one used in Wikipedia). The markdown converters I saw seem to convert from HTML to "Markdown" coding or reverse. That is not what I need.

The strip_tags() function in PHP isn't usable either, as it does not do very much about avoiding unnecessary blank lines, unnecessary whitespace, reasonably rendering tables, etc.

The function I wrote works relatively well (meanwhile I have further improved it). For three reasons I mentioned it here:
1. I may have overlooked some aspects (I already found some!).
2. I am far from an expert in regular expressions.
3. Others may be interested too.

The function is pretty simple. It does not bother about script tags etc., as it is not intended for incoming e-mails, but only for e-mails I generate on a website. They do not have any scripts simply becasue I am not putting scripts in.

[EDIT:]
If I add anything like this, it will most likely be in the form of utility classes but it's something that would be useful yep
I agree it should be in some class. In SwiftMailer 3.3.3 I extended the main Swift class with it. In 4.0.0 I am missing such a class, so I would it would be nice if you could add a class that can be used for custom extensions as well.
Post Reply