Detect html tags

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Detect html tags

Post by matthijs »

What I'm trying to do is find a function which detects if any kind of html tags are in a string. What I would like to use the function for is for forms submitted by users. I want to be able to give feedback (an error message) about the fact that html is not allowed. Of course I know I could just use strip_tags to get rid of any html. But as you know, that function is a bit too greedy. So it might strip out too much and cause confusion or frustration with a user who sees his content stripped.

Also, of course I use htmlentities when outputting data back to html, so I'm reasonable safe in that aspect. It's just that from a usibility viewpoint, I would like to warn people who use html that it is not allowed. For example, maybe some people would assume some tags like <b> can be used. Then, when they view their submitted entry, they see a) their tags stripped or b) htmlentitied code

Feyd showed some regex with which to strip tags:

Code: Select all

function megaStripTags($source)
{
    $p = array(
        '#<\s*(style|script)[^>]*>.*?<\s*/\s*\\1[^>]*>#si'                    => ' ',                        
//    convert <script> and <style> containers to a single space
        '#<(?:\s*/)?\s*[a-z]+(\s*[a-z]+\s*=\s*(["\']?)(.*?)\\2)*[^>]*>#si'    => ' ',                      
  //    convert all remaining tags to a space
        '#&nbsp;#i'                                                            => ' ',                       
 //    convert &nbsp; to a space
        '/&#(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]);/e'            => 'chr(intval("\\1"))',   
 //    convert &#0-255 into the character literal
        '/&/'                                                            => '&',                       
 //    convert & entity into the literal &
    );
    return preg_replace(array_keys($p),array_values($p),$source);
}

$test = '<TD WIDTH="14%" BACKGROUND="images.jpg">
<A HREF="http://something.xxx">
<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" 
ALT="Preview by Thumbshots"
WIDTH="45">testestets>blah</A></TD>';

var_dump(strip_tags($test),megaStripTags($test));
Now my idea was to convert this function to one which can be used to detect instead of strip tags.
I came up with this:

Code: Select all

<?php 

function dump($array) {
   echo '<pre>';
   print_r($array);
   echo '</pre>';
} 

$p = array(
        '#<\s*(style|script)[^>]*>.*?<\s*/\s*\\1[^>]*>#si' ,                       
        '#<(?:\s*/)?\s*[a-z]+(\s*[a-z]+\s*=\s*(["\']?)(.*?)\\2)*[^>]*>#si'  ,                        
        '#&nbsp;#i'                                                          ,                      
        '/&#(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]);/e'           ,   
        '/&/'                                                          ,                     
);
		
$test = '<TD WIDTH="14%" BACKGROUND="images.jpg">
<A HREF="http://something.xxx"><IMG SRC="image.gif" BORDER="0" 
ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots" 
WIDTH="45">testestets>blah</A></TD>';
		
//$test = '<b>test';        // result:  HTML found
//$test = '<script> this';   // result: HTML found
//$test = '<a href="somelinke">some</a>';  // result: HTML found
//$test = '<h 2>'; // result HTML found
//$test = 'And this is > then this or < then that';  // no HTML found
$test = 'And this is < then this or > then that';  // HTML found

echo 'The teststring is:  ' . htmlentities($test) . '<br>';

foreach ( $p as $value ) {

  if(preg_match($value,$test,$matches))
  {
    foreach($matches as $value) 
		{
		   echo '<br>HTML found: <br>';
		   dump(htmlentities($value));
    }
  }
}
?>
From the few tests I did it seemed to work quite well. However, as the last example showed it can be too greedy. But I guess that's almost impossible not to. So what do you think?
And, does anyone know of other regexes which I can use?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

the only "safe" way is the whitelist approach to detect actual tags from illegal "tags" .. anything not in the whitelist would then be run through htmlentities. Or the reverse can be done. Run the text through htmlentities first, then part the entitied text for valid tags converting them back.
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Post by matthijs »

Thanks Feyd, those are good ideas. So if I understand it well, I can do something like:

Code: Select all

<?php
$html  = array();

$somestring =  '<b>bold and <i> text';
$htmlstring = htmlentities($somestring, ENT_QUOTES, 'UTF-8');

// whitelist, or blacklist, depending on how you see it
$whitelist = array('<b>','<i>','<h1>','<h2>');// etc etc

foreach ($whitelist as $value)
{
  $pos = strpos($htmlstring,$value);
  if ($pos === false) {
     echo "The string '$value' was not found in the string '$htmlstring'";
  } else {
      echo "The string '$value' was found in the string '$htmlstring'";
  } 
}
?>
And then i can choose to a) allow certain tags or b) generate an error message telling html tags are not allowed
Post Reply