Page 1 of 1

Detect html tags

Posted: Thu Mar 23, 2006 11:09 am
by matthijs
What I'm trying to do is find a function which detects if any kind of html tags are in a string. What I would like to use the function for is for forms submitted by users. I want to be able to give feedback (an error message) about the fact that html is not allowed. Of course I know I could just use strip_tags to get rid of any html. But as you know, that function is a bit too greedy. So it might strip out too much and cause confusion or frustration with a user who sees his content stripped.

Also, of course I use htmlentities when outputting data back to html, so I'm reasonable safe in that aspect. It's just that from a usibility viewpoint, I would like to warn people who use html that it is not allowed. For example, maybe some people would assume some tags like <b> can be used. Then, when they view their submitted entry, they see a) their tags stripped or b) htmlentitied code

Feyd showed some regex with which to strip tags:

Code: Select all

function megaStripTags($source)
{
    $p = array(
        '#<\s*(style|script)[^>]*>.*?<\s*/\s*\\1[^>]*>#si'                    => ' ',                        
//    convert <script> and <style> containers to a single space
        '#<(?:\s*/)?\s*[a-z]+(\s*[a-z]+\s*=\s*(["\']?)(.*?)\\2)*[^>]*>#si'    => ' ',                      
  //    convert all remaining tags to a space
        '#&nbsp;#i'                                                            => ' ',                       
 //    convert &nbsp; to a space
        '/&#(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]);/e'            => 'chr(intval("\\1"))',   
 //    convert &#0-255 into the character literal
        '/&/'                                                            => '&',                       
 //    convert & entity into the literal &
    );
    return preg_replace(array_keys($p),array_values($p),$source);
}

$test = '<TD WIDTH="14%" BACKGROUND="images.jpg">
<A HREF="http://something.xxx">
<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" 
ALT="Preview by Thumbshots"
WIDTH="45">testestets>blah</A></TD>';

var_dump(strip_tags($test),megaStripTags($test));
Now my idea was to convert this function to one which can be used to detect instead of strip tags.
I came up with this:

Code: Select all

<?php 

function dump($array) {
   echo '<pre>';
   print_r($array);
   echo '</pre>';
} 

$p = array(
        '#<\s*(style|script)[^>]*>.*?<\s*/\s*\\1[^>]*>#si' ,                       
        '#<(?:\s*/)?\s*[a-z]+(\s*[a-z]+\s*=\s*(["\']?)(.*?)\\2)*[^>]*>#si'  ,                        
        '#&nbsp;#i'                                                          ,                      
        '/&#(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]);/e'           ,   
        '/&/'                                                          ,                     
);
		
$test = '<TD WIDTH="14%" BACKGROUND="images.jpg">
<A HREF="http://something.xxx"><IMG SRC="image.gif" BORDER="0" 
ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots" 
WIDTH="45">testestets>blah</A></TD>';
		
//$test = '<b>test';        // result:  HTML found
//$test = '<script> this';   // result: HTML found
//$test = '<a href="somelinke">some</a>';  // result: HTML found
//$test = '<h 2>'; // result HTML found
//$test = 'And this is > then this or < then that';  // no HTML found
$test = 'And this is < then this or > then that';  // HTML found

echo 'The teststring is:  ' . htmlentities($test) . '<br>';

foreach ( $p as $value ) {

  if(preg_match($value,$test,$matches))
  {
    foreach($matches as $value) 
		{
		   echo '<br>HTML found: <br>';
		   dump(htmlentities($value));
    }
  }
}
?>
From the few tests I did it seemed to work quite well. However, as the last example showed it can be too greedy. But I guess that's almost impossible not to. So what do you think?
And, does anyone know of other regexes which I can use?

Posted: Thu Mar 23, 2006 11:23 am
by feyd
the only "safe" way is the whitelist approach to detect actual tags from illegal "tags" .. anything not in the whitelist would then be run through htmlentities. Or the reverse can be done. Run the text through htmlentities first, then part the entitied text for valid tags converting them back.

Posted: Thu Mar 23, 2006 1:09 pm
by matthijs
Thanks Feyd, those are good ideas. So if I understand it well, I can do something like:

Code: Select all

<?php
$html  = array();

$somestring =  '<b>bold and <i> text';
$htmlstring = htmlentities($somestring, ENT_QUOTES, 'UTF-8');

// whitelist, or blacklist, depending on how you see it
$whitelist = array('<b>','<i>','<h1>','<h2>');// etc etc

foreach ($whitelist as $value)
{
  $pos = strpos($htmlstring,$value);
  if ($pos === false) {
     echo "The string '$value' was not found in the string '$htmlstring'";
  } else {
      echo "The string '$value' was found in the string '$htmlstring'";
  } 
}
?>
And then i can choose to a) allow certain tags or b) generate an error message telling html tags are not allowed