Page 1 of 1

A "StripHTML()" method to produce a plain-text version?

Posted: Tue Mar 10, 2009 7:29 am
by HHahn
I made a function StripHTML(), which receives the HTML message block as a parameter and strips off all HTML, replacing certain tags by e.g. tab characters to simulate a "table".
The function is likely to be pretty incmplete yet, but it does work.
In 3.3.3 I had it in an extension class to the Swift class. In 4.0.0 I made it a separate function, but it would be better if it were a method within a class.

Might it be a good idea to have something like that as a standard facility in in SwiftMailer?

Usage:

Code: Select all

$Msg->setBody($Content, "text/html");
$Msg->addPart(StripHtml ($Content), "text/plain"); 
Here is the current version of the code. It is a bit makeshift yet, but it works:

Code: Select all

//------ StripHtml(): ------
function StripHtml ($Text)
{
  $T = str_replace  ("/\r\n/",                    "\n",    $Text);    // "\r\n" to "\n"
  $T = preg_replace ("/[\x20\x9]*<td[^>]*>\n/"  , "\t",    $T);       // "<td>\n" to "\t"
  $T = preg_replace ("/[\x20\x9]*<td[^>]*>/",     "\t",    $T);       // "<td>" to "\t"
  $T = preg_replace ("/[\x20\x9]*<\/tr[^>]*>\n/", "\n",    $T);       // "</tr>\n" to "\n"
  $T = preg_replace ("/[\x20\x9]*<\/tr[^>]*>/",   "\n",    $T);       // "</tr>" to "\n"
  $T = preg_replace ("/<\/t[^>]*>\n/",            "<xxx>", $T);       // mark "</td>\n", "<tr>\n" etc.
  $T = preg_replace ("/<\/t[^>]*>/",              "<xxx>", $T);       // mark "</td>", "<tr>" etc.
  $T = preg_replace ("/<t[^>]+>\n/",              "<xxx>", $T);       // mark "<table>", "<tr>" etc.
  $T = preg_replace ("/[\x20\x9]*<xxx>/",         "<xxx>", $T);       // mark indentation before "<t...>"
  $T = preg_replace ("/<br[^>]*>\n/",             "\n",    $T);       // "<br>\n" to "\n"
  $T = preg_replace ("/<\/p[^>]*>\n/",            "\n",    $T);       // "</p>\n" to "\n"
  $T = preg_replace ("/<\/p[^>]*>/",              "\n",    $T);       // "</p>" to "\n"
  $T = preg_replace ("/<\/h\d[^>]*>\n?/",         "\n",    $T);       // "</h1>" etc. to "\n"
  $T = preg_replace ("/<\/?b>/",                  "*",     $T);       // "<b>" and "</b>" to "*"
  $T = preg_replace ("/<\/?i>/",                  "/",     $T);       // "<i>" and "</i>" to "/"
  $T = preg_replace ("/<[^>]*>/",                 "",      $T);       // remove all other HTML-tags, ...
                                                                      // ... including the temporary "<xxx>"
  return ($T);
}    // "StripHtml()"
 
//--------------------------

Re: A "StripHTML()" method to produce a plain-text version?

Posted: Tue Mar 10, 2009 7:37 am
by Chris Corbyn
If I add anything like this, it will most likely be in the form of utility classes but it's something that would be useful yep :)

There is actually another HTML -> Text converter available that is pretty well established from what I remember.

Re: A "StripHTML()" method to produce a plain-text version?

Posted: Tue Mar 10, 2009 12:01 pm
by xdecock
I personally use lynx to produce the text version, but this requires exec_* right & lynx binary.

if this idea might help you, otherwise search for HTML => Markdown converter

Re: A "StripHTML()" method to produce a plain-text version?

Posted: Fri Mar 13, 2009 10:29 am
by HHahn
I have googled for "HTML markdown converter", but "Markdown" seems to be a kind of (very) simplyfied markup language (more or less like the one used in Wikipedia). The markdown converters I saw seem to convert from HTML to "Markdown" coding or reverse. That is not what I need.

The strip_tags() function in PHP isn't usable either, as it does not do very much about avoiding unnecessary blank lines, unnecessary whitespace, reasonably rendering tables, etc.

The function I wrote works relatively well (meanwhile I have further improved it). For three reasons I mentioned it here:
1. I may have overlooked some aspects (I already found some!).
2. I am far from an expert in regular expressions.
3. Others may be interested too.

The function is pretty simple. It does not bother about script tags etc., as it is not intended for incoming e-mails, but only for e-mails I generate on a website. They do not have any scripts simply becasue I am not putting scripts in.

[EDIT:]
If I add anything like this, it will most likely be in the form of utility classes but it's something that would be useful yep
I agree it should be in some class. In SwiftMailer 3.3.3 I extended the main Swift class with it. In 4.0.0 I am missing such a class, so I would it would be nice if you could add a class that can be used for custom extensions as well.