Avoiding HTML code to ruin your site

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
dirgeshp
Forum Newbie
Posts: 23
Joined: Thu Jun 22, 2006 11:40 am

Avoiding HTML code to ruin your site

Post by dirgeshp »

I have made a site where users can post HTML tags and use it to display pics and links

I want to make sure they dont misuse it...is there any sort of code or technique i can use to avoid this

please help
Ward
Forum Commoner
Posts: 74
Joined: Thu Jul 13, 2006 10:01 am

Post by Ward »

Look up regular expressions. You can use them to filter a string, and strip out all tags except a,p,b,i, etc. Disallow script tags, etc.
dirgeshp
Forum Newbie
Posts: 23
Joined: Thu Jun 22, 2006 11:40 am

Post by dirgeshp »

so would be a good idea to block all <script> tags?
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

You can look at using some form of BBCode also.
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

dirgeshp wrote:so would be a good idea to block all <script> tags?
That's a good start but there are plenty of other ways to insert nasty pieces of javascript.
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

This subject was raised a few weeks ago (can't think of the thread at the moment) and what it boils down to is that allowing HTML to be added to your site without some seriously careful cleansing is very risky. The hard part is knowing all of the attributes of each HTML tag. Like inserting an image with an onMouseover attribute. You can actually tie in malicious JavaScript to the onMouseover event of that tag. That is bad stuff.
User avatar
daedalus__
DevNet Resident
Posts: 1925
Joined: Thu Feb 09, 2006 4:52 pm

Post by daedalus__ »

Instead of trying to block alot of tags, try to allow a few.
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

Yeah, I think the suggestion of a 'Whitelist' is what resolved the last thread, too.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

No, you don't have to know all of the attributes of each tag. You just have to know the safe minimum which is enough for your purposes.
Ward
Forum Commoner
Posts: 74
Joined: Thu Jul 13, 2006 10:01 am

Post by Ward »

Yes, I agree with the whitelist idea. It will take much less time to determine the allowable tags than the disallowable ones. For eample, you can use a regular expression to match everything within these tags:

Code: Select all

<a></a>
<p></p>
<b></b>
<i></i>
<strong></strong>
<em></em>
With a regex something like this (might not be perfect, I'm not great at regular expressions):

Code: Select all

[<][a][p][b][i][strong][em][>].*[</][a][p][b][i][strong][em][>]
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

I wrote this in a moment of regex madness. It removes all tags and attributes that are not allowed. Tries to find any hidden javascript too.

Code: Select all

function CleanUp($input)
{
	// list of allowed tags
	define('__HTML__', 'a|b|br|i|img|p|span');
	
	// list of allowed attributes
	define('__ATTRIBUTES__', 'src|alt|href|title|class|id');
	
	if(!function_exists('DisallowedTagsCallback'))
	{
		function DisallowedTagsCallback($input)
		{
		    $input[0] = strip_tags($input[0]);
			return htmlentities($input[0]);
		}
	}
	
	if(!function_exists('DisallowedAttributesCallback'))
	{
		function DisallowedAttributesCallback($input)
		{
		    $regex = '/\s*\b(?!(?:'.__ATTRIBUTES__.'))[a-z]+\b\s*[=]\s*([\'"])'.
		             '(((?!\1).)|((?<=[\\\])\1))*\1/is';
			return preg_replace($regex, '', $input[0]);
		}
	}
	
	// strip any javascript
	$regex = array('/\<script\b[^>]*\>((?!\<\/script\b[^>]*\>).)*\<\/script\b[^>]*\>/is', 
	               '/\s*on[a-z]+\s*[=]\s*(["\'])(((?!\1).)|((?<=[\\\])\1))*\1/i',
	               '/href+\s*[=]\s*(["\'])((?!\1).)*javascript((?!\1).)*\1/is');
	$replace = array('', '', 'href="#"');              
	$input = preg_replace($regex, $replace, $input);
	
	// strip disallowed tags
	$regex= " @(((?<=^)|(?<=[>]))(?![<]/?(".__HTML__.")\b[^>]*[>])".
	        "([^<]|((?![<]/?(".__HTML__.")\b[^>]*[>])[<]))+)@i";
	$input = preg_replace_callback($regex, 'DisallowedTagsCallback', $input);
	
	// strip disallowed attributes
	$regex = '/(?<=[<])[^>]+(?=[>])/';
	return preg_replace_callback($regex, 'DisallowedAttributesCallback', $input);
}
Post Reply