Combined BBCode and HTML
Moderator: General Moderators
Combined BBCode and HTML
I want to let users enter both BBCode and HTML at the same time. It works in normal use, but if somebody did something weird like [bbcode]</div>text[/bbcode] the output messes up if [bbcode] converts to a div. If the user is malicious they might be able to use codes to improperly nest the HTML.
The system I am using right now tokenizes HTML, validates it, then tokenizes BBCode. The indexes of each code in the token array are used to match and convert them with nesting. I can't think of any effective way to validate both simultaneously.
The system I am using right now tokenizes HTML, validates it, then tokenizes BBCode. The indexes of each code in the token array are used to match and convert them with nesting. I can't think of any effective way to validate both simultaneously.
- superdezign
- DevNet Master
- Posts: 4135
- Joined: Sat Jan 20, 2007 11:06 pm
- stereofrog
- Forum Contributor
- Posts: 386
- Joined: Mon Dec 04, 2006 6:10 am
Re: Combined BBCode and HTML
Convert BBCode to html first, then validate the resulting html.GameMusic wrote: The system I am using right now tokenizes HTML, validates it, then tokenizes BBCode. The indexes of each code in the token array are used to match and convert them with nesting. I can't think of any effective way to validate both simultaneously.
I don't want to bother the gurus here with this question, but I'm kind of stuck.
What's the fastest/lowest memory way to do this?
Here are the functions I'm currently using.
1. Recursive regexp (yikes!) to find non-HTML formatting BBCodes, ie nobr, noemot, nocode, nohtml, which quote such codes so that users can turn them off. A function takes the data, regexps, and the callback handles the codes, then calls the main function on the text in each code.
2. HTML regexp with a handler that keeps track of every code and marks each one that is decided to be passable, then a recursive regexp that unquotes correct HTML.
3. Tokenizer that regexps for BBCode, and a loop that uses push/pop arrays to detect how deep each code is and convert the correctly nested codes.
If I do the BBCode first technique, I'd probaby mark the generated HTML so that it's prioritized in the HTML functions.
So, as you can see, these are some hefty functions if the text is extremely complex. Regexp is often faster than people think because it's native, but a recursive regexp is probably a memory hog, which can be a big deal in PHP.
What's the fastest/lowest memory way to do this?
Here are the functions I'm currently using.
1. Recursive regexp (yikes!) to find non-HTML formatting BBCodes, ie nobr, noemot, nocode, nohtml, which quote such codes so that users can turn them off. A function takes the data, regexps, and the callback handles the codes, then calls the main function on the text in each code.
2. HTML regexp with a handler that keeps track of every code and marks each one that is decided to be passable, then a recursive regexp that unquotes correct HTML.
3. Tokenizer that regexps for BBCode, and a loop that uses push/pop arrays to detect how deep each code is and convert the correctly nested codes.
If I do the BBCode first technique, I'd probaby mark the generated HTML so that it's prioritized in the HTML functions.
So, as you can see, these are some hefty functions if the text is extremely complex. Regexp is often faster than people think because it's native, but a recursive regexp is probably a memory hog, which can be a big deal in PHP.
- stereofrog
- Forum Contributor
- Posts: 386
- Joined: Mon Dec 04, 2006 6:10 am
- superdezign
- DevNet Master
- Posts: 4135
- Joined: Sat Jan 20, 2007 11:06 pm
Well, I'm, tokenizing the BBCode. The method I'm using with HTML is sort of tokenizing. However I am not too experienced with this kind of technique so I have not come up with an effective way to analyze the token nesting, especially with multiple types. Where might I find info on a good parsing technique for doing this?
I will post some examples when I clean up the code.
Thanks
I will post some examples when I clean up the code.
Thanks
The HTML is quoted first (<>) and then unquoted when approved. HTMLFilter is the main function.
I tested it and code like <b><i></b></i> validates. That doesn't make sense because <b><i></b> should match in UnquoteDualTags, so it must be runing a second pass to catch the <i></i>? Could somebody more familiar with regexp explain this? Thanks.
Code: Select all
function HTMLFilter(&$string) {
// handle tags of the form <x att="">content</x>
$string = UnquoteDualTags(FindTags($string));
}
function FindTags($string) {
global $HTMLFilterTags;
// $HTMLFilterTags[div] = count of how many divs are open, used in TagHandler
// init tag counts each time a string is sent to this function to be filtered
$HTMLFilterTags = array();
// add matching codes to approved dual tags
$string = preg_replace_callback('/(<)(.*?)(>)/s', 'TagHandler', $string);
return $string;
}
/* TagHandler($matches): Adds a <number> code to matching approved HTML tags
so that tags of the same type are nested properly
ARGUMENTS: $matches(0 = whole tag, 1 = <, 2 = tag contents, 3 = >)
REFERENCES: Called from html filter
IMPROVEMENTS: Test if attributes are style, too big, etc.*/
function TagHandler($matches) {
global $HTMLFilterTags;
$tagCounts =& $HTMLFilterTags; // $tagCounts[div] = count of how many divs are open
// get tag name and attributes $tag = array(whole contents, open/close, name, attributes)
if(!preg_match('/^((?:\\/\\s*)?)(\\w+?)((?:\\s+\\w+=".*?")*?)\\s*$/is', $matches[2], $tag)) {
// not a tag, return
return $matches[0];
}
$tagname = $tag[2];
static $HTMLFilterTagLegalList =
array('b', 'i', 'u', 'div', 'span');
if(!in_array(strtolower($tagname), $HTMLFilterTagLegalList)) {
// not a legal tag, return
return $matches[0];
}
if($tagname && $tag[1] == '') {
// it's a properly formed opening tag
settype($tagCounts[$tagname], 'int');
$tagCounts[$tagname]++; // increase counter for this tag type
return $matches[1] . '<' . $tagCounts[$tagname] . '>' . $matches[2] . $matches[3];
}
else if($tagname && $tag[3] == '') {
// it's a properly formed closing tag
if($tagCounts[$tagname] > 0) {
// there's an open tag
settype($tagCounts[$tagname], 'int');
$tagCounts[$tagname]--; // decrease counter for this tag type
return $matches[1] . '<' . $tagCounts[$tagname] . '>' . $matches[2] . $matches[3];
}
}
return $matches[0];
}
function UnquoteDualTags($string) {
// unquote any properly formed dual tags <<number>tag att="variable"><<number>/tag>
// they should be marked with <number> by FindTags
$string = preg_replace_callback(
'/(<((?:<\\d*>)?)(\\w+?)((?:\\s+\\w+=".*?")*?)>(.*?))<\\2\\/\\3>/is',
'UnquoteHTML', $string);
// remove tag matching codes <x>
$string = preg_replace('/<\\d*>/is','', $string);
return $string;
}
/* UnquoteHTML($matches): Adds a code to matching approved HTML tags
so that tags of the same type are nested properly
ARGUMENTS: $matches(1 = opening tag + text, 2 = <number>, 3 = tagname, 4 = attributes, 5 = text)
REFERENCES: Called from html filter
IMPROVEMENTS: Test if attributes are style, too big, etc.*/
function UnquoteHTML($matches) {
return '<'.$matches[3].$matches[4].'>'.UnquoteDualTags($matches[5]).'</'.$matches[3].'>';
}- superdezign
- DevNet Master
- Posts: 4135
- Joined: Sat Jan 20, 2007 11:06 pm
Well, I'm too lazy to read the code as of yet
, however, there are three ways (that I know of) to handle nesting.
- One way is to work form the outside-in, handling nesting as you come across it by skipping the nested elements.
- Working from the inside-out by finding the deepest tokens and parsing them.
- Recursively handling token as you come across them.
- stereofrog
- Forum Contributor
- Posts: 386
- Joined: Mon Dec 04, 2006 6:10 am
Looks like you're decrementing the counter in a wrong place in the closing tag branch in TagHandler(). The code should probably beGameMusic wrote: I tested it and code like <b><i></b></i> validates. That doesn't make sense because <b><i></b> should match in UnquoteDualTags, so it must be runing a second pass to catch the <i></i>? Could somebody more familiar with regexp explain this? Thanks.
Code: Select all
$q = $matches[1] . '<' . ($tagCounts[$tagname]) . '>' . $matches[2] . $matches[3];
$tagCounts[$tagname]--; // decrease counter for this tag type
return $q;Code: Select all
while tok = get_token {
switch tok->type
case open_tag
push(stack, tok)
case close_tag
if stack[top]->tag_name == tok->tag_name
print <tok->tag_name> stack[top]->content </tok->tag_name>
pop(stack)
else
nesting error
default
stack[top]->content .= escape(tok)
}
if !empty(stack)
nesting error