Page 1 of 1

Avoiding HTML code to ruin your site

Posted: Mon Jul 17, 2006 11:36 am
by dirgeshp
I have made a site where users can post HTML tags and use it to display pics and links

I want to make sure they dont misuse it...is there any sort of code or technique i can use to avoid this

please help

Posted: Mon Jul 17, 2006 11:48 am
by Ward
Look up regular expressions. You can use them to filter a string, and strip out all tags except a,p,b,i, etc. Disallow script tags, etc.

Posted: Mon Jul 17, 2006 11:58 am
by dirgeshp
so would be a good idea to block all <script> tags?

Posted: Mon Jul 17, 2006 12:20 pm
by RobertGonzalez
You can look at using some form of BBCode also.

Posted: Mon Jul 17, 2006 12:31 pm
by bokehman
dirgeshp wrote:so would be a good idea to block all <script> tags?
That's a good start but there are plenty of other ways to insert nasty pieces of javascript.

Posted: Mon Jul 17, 2006 12:33 pm
by RobertGonzalez
This subject was raised a few weeks ago (can't think of the thread at the moment) and what it boils down to is that allowing HTML to be added to your site without some seriously careful cleansing is very risky. The hard part is knowing all of the attributes of each HTML tag. Like inserting an image with an onMouseover attribute. You can actually tie in malicious JavaScript to the onMouseover event of that tag. That is bad stuff.

Posted: Mon Jul 17, 2006 12:34 pm
by daedalus__
Instead of trying to block alot of tags, try to allow a few.

Posted: Mon Jul 17, 2006 12:35 pm
by RobertGonzalez
Yeah, I think the suggestion of a 'Whitelist' is what resolved the last thread, too.

Posted: Mon Jul 17, 2006 12:36 pm
by Weirdan
No, you don't have to know all of the attributes of each tag. You just have to know the safe minimum which is enough for your purposes.

Posted: Mon Jul 17, 2006 12:55 pm
by Ward
Yes, I agree with the whitelist idea. It will take much less time to determine the allowable tags than the disallowable ones. For eample, you can use a regular expression to match everything within these tags:

Code: Select all

<a></a>
<p></p>
<b></b>
<i></i>
<strong></strong>
<em></em>
With a regex something like this (might not be perfect, I'm not great at regular expressions):

Code: Select all

[<][a][p][b][i][strong][em][>].*[</][a][p][b][i][strong][em][>]

Posted: Mon Jul 17, 2006 7:57 pm
by bokehman
I wrote this in a moment of regex madness. It removes all tags and attributes that are not allowed. Tries to find any hidden javascript too.

Code: Select all

function CleanUp($input)
{
	// list of allowed tags
	define('__HTML__', 'a|b|br|i|img|p|span');
	
	// list of allowed attributes
	define('__ATTRIBUTES__', 'src|alt|href|title|class|id');
	
	if(!function_exists('DisallowedTagsCallback'))
	{
		function DisallowedTagsCallback($input)
		{
		    $input[0] = strip_tags($input[0]);
			return htmlentities($input[0]);
		}
	}
	
	if(!function_exists('DisallowedAttributesCallback'))
	{
		function DisallowedAttributesCallback($input)
		{
		    $regex = '/\s*\b(?!(?:'.__ATTRIBUTES__.'))[a-z]+\b\s*[=]\s*([\'"])'.
		             '(((?!\1).)|((?<=[\\\])\1))*\1/is';
			return preg_replace($regex, '', $input[0]);
		}
	}
	
	// strip any javascript
	$regex = array('/\<script\b[^>]*\>((?!\<\/script\b[^>]*\>).)*\<\/script\b[^>]*\>/is', 
	               '/\s*on[a-z]+\s*[=]\s*(["\'])(((?!\1).)|((?<=[\\\])\1))*\1/i',
	               '/href+\s*[=]\s*(["\'])((?!\1).)*javascript((?!\1).)*\1/is');
	$replace = array('', '', 'href="#"');              
	$input = preg_replace($regex, $replace, $input);
	
	// strip disallowed tags
	$regex= " @(((?<=^)|(?<=[>]))(?![<]/?(".__HTML__.")\b[^>]*[>])".
	        "([^<]|((?![<]/?(".__HTML__.")\b[^>]*[>])[<]))+)@i";
	$input = preg_replace_callback($regex, 'DisallowedTagsCallback', $input);
	
	// strip disallowed attributes
	$regex = '/(?<=[<])[^>]+(?=[>])/';
	return preg_replace_callback($regex, 'DisallowedAttributesCallback', $input);
}