Page 1 of 1

Combined BBCode and HTML

Posted: Thu Aug 09, 2007 7:24 pm
by GameMusic
I want to let users enter both BBCode and HTML at the same time. It works in normal use, but if somebody did something weird like [bbcode]</div>text[/bbcode] the output messes up if [bbcode] converts to a div. If the user is malicious they might be able to use codes to improperly nest the HTML.

The system I am using right now tokenizes HTML, validates it, then tokenizes BBCode. The indexes of each code in the token array are used to match and convert them with nesting. I can't think of any effective way to validate both simultaneously.

Posted: Thu Aug 09, 2007 8:27 pm
by superdezign
Then, you should tokenize them together and validate the groupings. Your current method of tokenizing likely uses regex, so you can just give it a '([bbcode]|<html>)' kind of pattern for tokenization.

Re: Combined BBCode and HTML

Posted: Fri Aug 10, 2007 3:18 am
by stereofrog
GameMusic wrote: The system I am using right now tokenizes HTML, validates it, then tokenizes BBCode. The indexes of each code in the token array are used to match and convert them with nesting. I can't think of any effective way to validate both simultaneously.
Convert BBCode to html first, then validate the resulting html.

Posted: Fri Aug 10, 2007 12:37 pm
by GameMusic
I don't want to bother the gurus here with this question, but I'm kind of stuck.

What's the fastest/lowest memory way to do this?

Here are the functions I'm currently using.

1. Recursive regexp (yikes!) to find non-HTML formatting BBCodes, ie nobr, noemot, nocode, nohtml, which quote such codes so that users can turn them off. A function takes the data, regexps, and the callback handles the codes, then calls the main function on the text in each code.

2. HTML regexp with a handler that keeps track of every code and marks each one that is decided to be passable, then a recursive regexp that unquotes correct HTML.

3. Tokenizer that regexps for BBCode, and a loop that uses push/pop arrays to detect how deep each code is and convert the correctly nested codes.

If I do the BBCode first technique, I'd probaby mark the generated HTML so that it's prioritized in the HTML functions.

So, as you can see, these are some hefty functions if the text is extremely complex. Regexp is often faster than people think because it's native, but a recursive regexp is probably a memory hog, which can be a big deal in PHP.

Posted: Sat Aug 11, 2007 3:28 am
by stereofrog
GameMusic wrote: What's the fastest/lowest memory way to do this?
Hard to say without seeing the code (btw posting a couple of examples would be helpful).
Besides the regexp approach, there are also different parsing techniques you may find useful.

Posted: Sat Aug 11, 2007 6:38 am
by superdezign
Why use all of those techniques when you are already tokenizing the data? You have pretty set-in-stone left and right delimiters...

Posted: Sat Aug 11, 2007 9:41 am
by GameMusic
Well, I'm, tokenizing the BBCode. The method I'm using with HTML is sort of tokenizing. However I am not too experienced with this kind of technique so I have not come up with an effective way to analyze the token nesting, especially with multiple types. Where might I find info on a good parsing technique for doing this?

I will post some examples when I clean up the code.

Thanks

Posted: Sun Aug 12, 2007 10:59 am
by GameMusic
The HTML is quoted first (<>) and then unquoted when approved. HTMLFilter is the main function.

Code: Select all

function HTMLFilter(&$string) {
	// handle tags of the form <x att="">content</x>
	$string = UnquoteDualTags(FindTags($string));
}

function FindTags($string) {
	global $HTMLFilterTags;

	// $HTMLFilterTags[div] = count of how many divs are open, used in TagHandler
	// init tag counts each time a string is sent to this function to be filtered
	$HTMLFilterTags = array();

	// add matching codes to approved dual tags
	$string = preg_replace_callback('/(<)(.*?)(>)/s', 'TagHandler', $string);
	return $string;
}

/* TagHandler($matches): Adds a <number> code to matching approved HTML tags
so that tags of the same type are nested properly
ARGUMENTS: $matches(0 = whole tag, 1 = <, 2 = tag contents, 3 = >)
REFERENCES: Called from html filter
IMPROVEMENTS: Test if attributes are style, too big, etc.*/
function TagHandler($matches) {
	global $HTMLFilterTags;
	$tagCounts =& $HTMLFilterTags; // $tagCounts[div] = count of how many divs are open

	// get tag name and attributes $tag = array(whole contents, open/close, name, attributes)
	if(!preg_match('/^((?:\\/\\s*)?)(\\w+?)((?:\\s+\\w+=".*?")*?)\\s*$/is', $matches[2], $tag)) {
		// not a tag, return
		return $matches[0];
	}
	$tagname = $tag[2];

	static $HTMLFilterTagLegalList =
		array('b', 'i', 'u', 'div', 'span');
	if(!in_array(strtolower($tagname), $HTMLFilterTagLegalList)) {
		// not a legal tag, return
		return $matches[0];
	}
	if($tagname && $tag[1] == '') {
		// it's a properly formed opening tag
		settype($tagCounts[$tagname], 'int');
		$tagCounts[$tagname]++; // increase counter for this tag type
		return $matches[1] . '<' . $tagCounts[$tagname] . '>' . $matches[2] . $matches[3];
	}
	else if($tagname && $tag[3] == '') {
		// it's a properly formed closing tag
		if($tagCounts[$tagname] > 0) {
			// there's an open tag
			settype($tagCounts[$tagname], 'int');
			$tagCounts[$tagname]--; // decrease counter for this tag type
			return $matches[1] . '<' . $tagCounts[$tagname] . '>' . $matches[2] . $matches[3];
		}
	}
	return $matches[0];
}

function UnquoteDualTags($string) {
    // unquote any properly formed dual tags <<number>tag att="variable"><<number>/tag>
    // they should be marked with <number> by FindTags
	$string = preg_replace_callback(
'/(<((?:<\\d*>)?)(\\w+?)((?:\\s+\\w+=".*?")*?)>(.*?))<\\2\\/\\3>/is',
'UnquoteHTML', $string);
	// remove tag matching codes <x>
	$string = preg_replace('/<\\d*>/is','', $string);
	return $string;
}

/* UnquoteHTML($matches): Adds a code to matching approved HTML tags
so that tags of the same type are nested properly
ARGUMENTS: $matches(1 = opening tag + text, 2 = <number>, 3 = tagname, 4 = attributes, 5 = text)
REFERENCES: Called from html filter
IMPROVEMENTS: Test if attributes are style, too big, etc.*/
function UnquoteHTML($matches) {
	return '<'.$matches[3].$matches[4].'>'.UnquoteDualTags($matches[5]).'</'.$matches[3].'>';
}
I tested it and code like <b><i></b></i> validates. That doesn't make sense because <b><i></b> should match in UnquoteDualTags, so it must be runing a second pass to catch the <i></i>? Could somebody more familiar with regexp explain this? Thanks.

Posted: Sun Aug 12, 2007 11:57 am
by superdezign
Well, I'm too lazy to read the code as of yet :P, however, there are three ways (that I know of) to handle nesting.
  • One way is to work form the outside-in, handling nesting as you come across it by skipping the nested elements.
  • Working from the inside-out by finding the deepest tokens and parsing them.
  • Recursively handling token as you come across them.
I tend to favor the last method, but sometimes it's easier to do the others.

Posted: Sun Aug 12, 2007 1:28 pm
by stereofrog
GameMusic wrote: I tested it and code like <b><i></b></i> validates. That doesn't make sense because <b><i></b> should match in UnquoteDualTags, so it must be runing a second pass to catch the <i></i>? Could somebody more familiar with regexp explain this? Thanks.
Looks like you're decrementing the counter in a wrong place in the closing tag branch in TagHandler(). The code should probably be

Code: Select all

$q = $matches[1] . '<' . ($tagCounts[$tagname]) . '>' . $matches[2] . $matches[3];
$tagCounts[$tagname]--; // decrease counter for this tag type 
return $q;
However, you might be better off making your stack-based parser more explicit. Read tokens one by one, push opening tags onto the stack and pop them off when you see a closing tag. In pseudocode

Code: Select all

while tok = get_token {
	switch tok->type
		case open_tag
			push(stack, tok)
		
		case close_tag
			if stack[top]->tag_name == tok->tag_name
				print <tok->tag_name> stack[top]->content </tok->tag_name>
				pop(stack)
			else
				nesting error
		
		default	
			stack[top]->content .= escape(tok)
}

if !empty(stack)
	nesting error