Nonrecursive, stack based BBCode parser
Moderator: General Moderators
Nonrecursive, stack based BBCode parser
My goal is to write a stack based BBCode parser that works via a single pass. That is, it only tokenizes once. Also, this parser should have to be able to handle the weird case of [quote="[b]jeff[/b]"]hey[/quote] without resorting to calling a new instance of the parser... Any ideas?
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Well, as long as you filter it properly...The users can't be trusted, who knows what evil they might decide to submit?
I assume you want this parser in order to handle the fringe case. Otherwise, you probably should go with a prewritten package (PEAR's got one: http://pear.php.net/package/HTML_BBCodeParser)
So, do you know where to start? If not, I suggest you take a look at existing code to get a handle on what doing stack based parsing will entail.
I would recommend first parsing the text into an array of text and bbcode tags.
feyd | Please use
feyd | Please use
Code: Select all
,Code: Select all
and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read: [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]
Yes, I have looked at many, many existing solutions.
Parsing it into an array of text and BBCode tags does nothing for the fringe case because it would be considered to be an attribute...
Here is a quick BBCode to array thing using just a regex.Code: Select all
<?php
$data = '[b]as[i]da[/i]sd[/b]';
$regex = '#\[(/\w+|\w+\s*(="[^"]*")?)]#';
preg_match_all($regex, $data, $result);
$split = preg_split($regex, $data);
$stack = array();
for ($i = 0, $max = sizeof($split); $i < $max; ++$i) {
$stack[] = array_shift($split);
$stack[] = array_shift($result[0]);
}
?>feyd | Please use
Code: Select all
,Code: Select all
and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read: [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]
Last edited by mator on Sat Aug 19, 2006 5:22 pm, edited 1 time in total.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Okay, now you need to iterate through all the tags and attempt to make them match up. You accumulate a stack showing which tags you currently are in. For that bbcode sample, it'll look like
1. b (first bold tag)
2. b, i (first italic tag)
3. b (hit end italic tag, it was open, so allow it)
4. (hit end bold tag, same procedure.
As for the attribute problem, I've thought about it, and a better way would probably be like this:
You have a fundamental problem with the attribute thing because HTML simply is designed in that fashion.
1. b (first bold tag)
2. b, i (first italic tag)
3. b (hit end italic tag, it was open, so allow it)
4. (hit end bold tag, same procedure.
As for the attribute problem, I've thought about it, and a better way would probably be like this:
Code: Select all
[quoteblock]
[quotename]Na[b]me[/b][/quotename]
[quotebody]Text[/quotebody]
[/quoteblock]
Last edited by Ambush Commander on Sat Aug 19, 2006 5:24 pm, edited 1 time in total.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Hrmm...
No multiple parser instances, but you want to have nested tags.
Any recursive function can be made non-recursive, but often the price paid is reduced code maintainability and more convoluted call structures. Here's what you want to do:
1. Ditch the regexps and go through the string using strpos, strcspn and friends.
2. When you hit a bracket, register the tag name, and start parsing the attributes
3. At this point, you've clicked into a local register of tokens, while parsing the attribute, assign tokens to that local register as if it were the real one.
4. Once you hit the end of the attribute, check for the ending bracket, switch back to the global register.
5. Your choice: either convert that bbcode to HTML right here or save it as bbcode for later parsing.
6. Then, while generating HTML from the bbcode, maintain the state "inside attribute" or "outside attribute". When you're inside an attribute, direct the HTML output to wherever it needs to go in the final HTML output.
No multiple parser instances, but you want to have nested tags.
Any recursive function can be made non-recursive, but often the price paid is reduced code maintainability and more convoluted call structures. Here's what you want to do:
1. Ditch the regexps and go through the string using strpos, strcspn and friends.
2. When you hit a bracket, register the tag name, and start parsing the attributes
3. At this point, you've clicked into a local register of tokens, while parsing the attribute, assign tokens to that local register as if it were the real one.
4. Once you hit the end of the attribute, check for the ending bracket, switch back to the global register.
5. Your choice: either convert that bbcode to HTML right here or save it as bbcode for later parsing.
6. Then, while generating HTML from the bbcode, maintain the state "inside attribute" or "outside attribute". When you're inside an attribute, direct the HTML output to wherever it needs to go in the final HTML output.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Yep, you won't be able to do it with regexps (well, maybe you could, but it'd be a big regexp).
Let me show you what the code might look like:
Let me show you what the code might look like:
Code: Select all
$cursor = 0; // our location in the text
$inside_tag = false; // whether or not we're parsing the inside of a tag
$array = array(); // result array
// infinite loop protection
// has to be pretty big, since html docs can be big
// we're allow two hundred thousand tags... more than enough?
$loops = 0;
while(true) {
// infinite loop protection
if (++$loops > 200000) return array();
$position_next_lt = strpos($string, '<', $cursor);
$position_next_gt = strpos($string, '>', $cursor);
// triggers on "<b>asdf</b>" but not "asdf <b></b>"
if ($position_next_lt === $cursor) {
$inside_tag = true;
$cursor++;
}
if (!$inside_tag && $position_next_lt !== false) {
// We are not inside tag and there still is another tag to parse
$array[] = new
HTMLPurifier_Token_Text(
$this->parseData(
substr(
$string, $cursor, $position_next_lt - $cursor
)
)
);
$cursor = $position_next_lt + 1;
$inside_tag = true;
continue;
} elseif (!$inside_tag) {
// We are not inside tag but there are no more tags
// If we're already at the end, break
if ($cursor === strlen($string)) break;
// Create Text of rest of string
$array[] = new
HTMLPurifier_Token_Text(
$this->parseData(
substr(
$string, $cursor
)
)
);
break;
} elseif ($inside_tag && $position_next_gt !== false) {
// We are in tag and it is well formed
// Grab the internals of the tag
$strlen_segment = $position_next_gt - $cursor;
$segment = substr($string, $cursor, $strlen_segment);
// Check if it's a comment
if (
substr($segment, 0, 3) == '!--' &&
substr($segment, $strlen_segment-2, 2) == '--'
) {
$array[] = new
HTMLPurifier_Token_Comment(
substr(
$segment, 3, $strlen_segment - 5
)
);
$inside_tag = false;
$cursor = $position_next_gt + 1;
continue;
}
// Check if it's an end tag
$is_end_tag = (strpos($segment,'/') === 0);
if ($is_end_tag) {
$type = substr($segment, 1);
$array[] = new HTMLPurifier_Token_End($type);
$inside_tag = false;
$cursor = $position_next_gt + 1;
continue;
}
// Check if it is explicitly self closing, if so, remove
// trailing slash. Remember, we could have a tag like <br>, so
// any later token processing scripts must convert improperly
// classified EmptyTags from StartTags.
$is_self_closing= (strpos($segment,'/') === $strlen_segment-1);
if ($is_self_closing) {
$strlen_segment--;
$segment = substr($segment, 0, $strlen_segment);
}
// Check if there are any attributes
$position_first_space = strcspn($segment, $this->_whitespace);
if ($position_first_space >= $strlen_segment) {
if ($is_self_closing) {
$array[] = new HTMLPurifier_Token_Empty($segment);
} else {
$array[] = new HTMLPurifier_Token_Start($segment);
}
$inside_tag = false;
$cursor = $position_next_gt + 1;
continue;
}
// Grab out all the data
$type = substr($segment, 0, $position_first_space);
$attribute_string =
trim(
substr(
$segment, $position_first_space
)
);
if ($attribute_string) {
$attributes = $this->parseAttributeString(
$attribute_string
);
} else {
$attributes = array();
}
if ($is_self_closing) {
$array[] = new HTMLPurifier_Token_Empty($type, $attributes);
} else {
$array[] = new HTMLPurifier_Token_Start($type, $attributes);
}
$cursor = $position_next_gt + 1;
$inside_tag = false;
continue;
} else {
$array[] = new
HTMLPurifier_Token_Text(
'<' .
$this->parseData(
substr($string, $cursor)
)
);
break;
}
break;
}
return $array;Hmm, http://php.net/strtok#43381 has a pretty darn good approach, i just can't seem to morph it into what you suggested... could you give it a shot?
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
You can't use strtok because the function doesn't return information on which delimeter is used. For instance, if you're in an attribute, and the two possible delimiters are [ and ", one of them opening a tag, and the other one closing the attribute. strtok will grab the segment for you, but then you'll be in the dark on what to do next.
- Chris Corbyn
- Breakbeat Nuttzer
- Posts: 13098
- Joined: Wed Mar 24, 2004 7:57 am
- Location: Melbourne, Australia
Here's one I wrote a while back. It checks all tags are closed etc, and if not it doesn't parse them.
It's far from being well-written but it does work.
EDIT | It also supports the nesting you're referring to with things like [quote="[url=xxx ] foo [/b ][/url ]"[/quote ]
It's far from being well-written but it does work.
EDIT | It also supports the nesting you're referring to with things like [quote="[url=xxx ] foo [/b ][/url ]"[/quote ]
Code: Select all
<?php
class BBCode
{
protected
$source,
$output,
$tokens = array(),
$tags = array(
'b' => '\[b\]',
'u' => '\[u\]',
'i' => '\[i\]',
'size' => '\[size=([^\]]+)\]',
'color' => '\[color=([^\]]+)\]',
'img' => '\[img\]',
'url' => '\[url=([^\]]+)\]',
'quote' => '\[quote(?:="([^"]+)")?\]',
'code' => '\[code(?:=([^\]]+))?\]',
'list' => '\[list(?:=([^\]]+))?\]',
'smilies' => ':dense:|:drunk:|:\)|:-\)|:smile:|;\)|;-\)|:wink:|:p|:P|:-p|:-P|:\(|:-\(|:sad:|:razz:|:\?|:confused:|:\||:-\||:blank:|:nod:|:shake:|:reading:|:cool:|:D|:-D|:grin:|:hit:|:smart:|:laugh:|:lol:|:thumbsup:|:thumbsdown:|:cry:|:wavecry:|:arrow:|:angry:|:woot:|:love:|:rolleyes:|:unsure:|:angel:|:clap:|:buzz:|:shocked:|:food:|:sleep:|:smirk:|:wiz:|:wizard:'
);
private $smiliePath = 'sys/img/smilies';
function __construct($input)
{
$this->setSource($input);
}
//Tag name => RegExp
public function setBBTag($tag, $eval)
{
$this->tags[$tag] = $eval;
}
protected function setSource($source)
{
$this->source = $source;
}
//Recursive
protected function tokenize($tag, $text=false, $ignore=array(), $stack=array(), $recursing=false)
{
if (!$text && !$recursing) $text = $this->source;
elseif (!$text && $recursing) return $stack; //Nothing left to do
$block = '';
foreach ($ignore as $t) $block .= $this->tags[$t].'.*?\[/'.$t.'\]|';
$re = '@'.$block.$this->tags[$tag].'|\[/'.$tag.'\]@is'; //Opening tag or closing tag
if (preg_match($re, $text, $matches, PREG_OFFSET_CAPTURE))
{
$chunks = $matches[0];
$offset = 0;
if ($chunks[1] > 0)
{
$offset = $chunks[1]; //Preg offset (substr)
$plain_text = substr($text, 0, $offset);
$stack[] = $plain_text; //Text before the tag
}
$stack[] = $chunks[0]; //The actual tag
$text = substr($text, (strlen($chunks[0])+$offset)); //Drop chunk off the string since already processed
return $this->tokenize($tag, $text, $ignore, $stack, 1); //Recurse
}
else
{
$stack[] = $text; //No match, nothing left to do
return $stack;
}
}
//Nothing more than a cascade through all the handlers
public function parseComplete($source='')
{
if (empty($source)) $source = $this->source;
$source = $this->parseSmilies($source);
$source = $this->parseBold($source);
$source = $this->parseItalic($source);
$source = $this->parseUnderline($source);
$source = $this->parseQuotes($source);
$source = $this->parseUrl($source);
$source = $this->parseCode($source);
$this->output = $source;
return $this->output;
}
public function parseSmilies($source='')
{
if (empty($source)) $source = $this->source;
$this->tokens = $this->tokenize('smilies', $source); //Fetch tokens
$ret = '';
foreach ($this->tokens as $tok)
{
switch ($tok)
{
case ':)':
case ':-)':
case ':smile:':
$ret .= '<img src="'.$this->smiliePath.'/smile1.gif" alt="smile" />';
break;
case ';)':
case ';-)':
case ':wink:':
$ret .= '<img src="'.$this->smiliePath.'/wink.gif" alt="wink" />';
break;
case ':p':
case ':P':
case ':-p':
case ':-P':
case ':razz:':
$ret .= '<img src="'.$this->smiliePath.'/tongue.gif" alt="razz" />';
break;
case ':(':
case ':-(':
case ':sad:':
$ret .= '<img src="'.$this->smiliePath.'/sad.gif" alt="sad" />';
break;
case ':nod:':
$ret .= '<img src="'.$this->smiliePath.'/yes.gif" alt="yes" />';
break;
case ':shake:':
$ret .= '<img src="'.$this->smiliePath.'/no.gif" alt="no" />';
break;
case ':reading:':
$ret .= '<img src="'.$this->smiliePath.'/coffee.gif" alt="coffee" />';
break;
case ':cool:':
$ret .= '<img src="'.$this->smiliePath.'/cool2.gif" alt="cool" />';
break;
case ':D':
case ':-D':
case ':grin:':
$ret .= '<img src="'.$this->smiliePath.'/grin.gif" alt="grin" />';
break;
case ':laugh:':
case ':lol:':
$ret .= '<img src="'.$this->smiliePath.'/laugh.gif" alt="laugh" />';
break;
case ':hit:':
$ret .= '<img src="'.$this->smiliePath.'/hit.gif" alt="wollop" />';
break;
case ':thumbsup:':
$ret .= '<img src="'.$this->smiliePath.'/thumbsup.gif" alt="thumbs up" />';
break;
case ':thumbsdown:':
$ret .= '<img src="'.$this->smiliePath.'/thumbsdown.gif" alt="thumbs down" />';
break;
case ':dense:':
$ret .= '<img src="'.$this->smiliePath.'/dense.gif" alt="dense" />';
break;
case ':smart:':
$ret .= '<img src="'.$this->smiliePath.'/smartass.gif" alt="smart" />';
break;
case ':?':
case ':confused:':
$ret .= '<img src="'.$this->smiliePath.'/huh.gif" alt="confused" />';
break;
case ':arrow:':
$ret .= '<img src="'.$this->smiliePath.'/arrow.gif" alt="arrow" />';
break;
case ':cry:':
$ret .= '<img src="'.$this->smiliePath.'/cry.gif" alt="cry" />';
break;
case ':wavecry:':
$ret .= '<img src="'.$this->smiliePath.'/wavecry.gif" alt="waving crying" />';
break;
case ':angry:':
$ret .= '<img src="'.$this->smiliePath.'/wag.gif" alt="angry" />';
break;
case ':love:':
$ret .= '<img src="'.$this->smiliePath.'/wub.gif" alt="lovey dovey" />';
break;
case ':woot:':
$ret .= '<img src="'.$this->smiliePath.'/w00t.gif" alt="woot!" />';
break;
case ':rolleyes:':
$ret .= '<img src="'.$this->smiliePath.'/rolleyes.gif" alt="rolling eyes" />';
break;
case ':unsure:':
$ret .= '<img src="'.$this->smiliePath.'/hmmm.gif" alt="hmmm" />';
break;
case ':angel:':
$ret .= '<img src="'.$this->smiliePath.'/angel.gif" alt="angel" />';
break;
case ':clap:':
$ret .= '<img src="'.$this->smiliePath.'/clap2.gif" alt="clapping" />';
break;
case ':drunk:':
$ret .= '<img src="'.$this->smiliePath.'/drunk.gif" alt="drunk" />';
break;
case ':buzz:':
$ret .= '<img src="'.$this->smiliePath.'/mml.gif" alt="buzzing" />';
break;
case ':|':
case ':-|':
case ':blank:':
$ret .= '<img src="'.$this->smiliePath.'/noexpression.gif" alt="blank" />';
break;
case ':shocked:':
$ret .= '<img src="'.$this->smiliePath.'/ohmy.gif" alt="shocked" />';
break;
case ':food:':
$ret .= '<img src="'.$this->smiliePath.'/pizza.gif" alt="pizza" />';
break;
case ':sleep:':
$ret .= '<img src="'.$this->smiliePath.'/sleeping.gif" alt="sleeping" />';
break;
case ':smirk:':
$ret .= '<img src="'.$this->smiliePath.'/smirk.gif" alt="smirk" />';
break;
case ':wiz:':
case ':wizard:':
$ret .= '<img src="'.$this->smiliePath.'/wizard.gif" alt="wizard" />';
break;
default: $ret .= $tok;
}
}
return $ret;
}
public function parseUrl($source='')
{
if (empty($source)) $source = $this->source;
$this->tokens = $this->tokenize('url', $source); //Fetch tokens
if (!$this->checkClosure('url'))
{
$this->output = $source;
return $this->output;
}
$ret = '';
foreach ($this->tokens as $tok)
{
if (preg_match('@'.$this->tags['url'].'@is', $tok, $matches))
{ //Opening quote
$href = $this->makeAbsoluteUrl($matches[1]);
$ret .= '<a href="'.$href.'" target="_blank">';
}
elseif (preg_match('@\[/url\]@is', $tok, $matches))
{
$ret .= '</a>'; //Close the quote box
}
else
{
$ret .= $tok; //Insert the text
}
}
$this->output = $ret;
return $this->output;
}
private function checkClosure($tag)
{
$opened = 0;
$closed = 0;
foreach ($this->tokens as $tok)
{
if (preg_match('@'.$this->tags[$tag].'@is', $tok, $matches))
$opened++;
elseif (preg_match('@\[/'.$tag.'\]@is', $tok, $matches))
$closed++;
}
if ($opened === $closed) return true;
}
private function makeAbsoluteUrl($url)
{
if (!preg_match('@^[a-z]+://@i', $url)) $url = 'http://'.$url;
return $url;
}
public function parseQuotes($source='')
{
if (empty($source)) $source = $this->source;
$this->tokens = $this->tokenize('quote', $source, array('code')); //Fetch tokens
if (!$this->checkClosure('quote'))
{
$this->output = $source;
return $this->output;
}
$ret = '';
foreach ($this->tokens as $tok)
{
if (preg_match('@'.$this->tags['quote'].'@is', $tok, $matches))
{ //Opening quote
$info = 'Quote';
if (!empty($matches[1])) $info = $matches[1].' wrote'; //Name parameter given
$ret .= '<div style="font-style: italic; border: 1px dotted #777777; background: #FFFFF8; padding: 4px; margin: 4px;">
<div style="font-weight: bold; font-style: normal;">'.$info.':</div>';
}
elseif (preg_match('@\[/quote\]@is', $tok, $matches))
{
$ret .= '</div>'; //Close the quote box
}
else
{
$ret .= $tok; //Insert the text
}
}
$this->output = $ret;
return $this->output;
}
public function parseBold($source='')
{
if (empty($source)) $source = $this->source;
$this->tokens = $this->tokenize('b', $source, array('code')); //Fetch tokens
$ret = '';
if (!$this->checkClosure('b'))
{
$this->output = $source;
return $this->output;
}
foreach ($this->tokens as $tok)
{
if (preg_match('@'.$this->tags['b'].'@is', $tok, $matches))
{ //Opening tag
$ret .= '<strong>';
}
elseif (preg_match('@\[/b\]@is', $tok, $matches))
{
$ret .= '</strong>'; //Close the tag
}
else
{
$ret .= $tok; //Insert the text
}
}
$this->output = $ret;
return $this->output;
}
public function parseItalic($source='')
{
if (empty($source)) $source = $this->source;
$this->tokens = $this->tokenize('i', $source, array('code')); //Fetch tokens
$ret = '';
if (!$this->checkClosure('i'))
{
$this->output = $source;
return $this->output;
}
foreach ($this->tokens as $tok)
{
if (preg_match('@'.$this->tags['i'].'@is', $tok, $matches))
{ //Opening tag
$ret .= '<em>';
}
elseif (preg_match('@\[/i\]@is', $tok, $matches))
{
$ret .= '</em>'; //Close the tag
}
else
{
$ret .= $tok; //Insert the text
}
}
$this->output = $ret;
return $this->output;
}
public function parseUnderline($source='')
{
if (empty($source)) $source = $this->source;
$this->tokens = $this->tokenize('u', $source, array('code')); //Fetch tokens
$ret = '';
if (!$this->checkClosure('u'))
{
$this->output = $source;
return $this->output;
}
foreach ($this->tokens as $tok)
{
if (preg_match('@'.$this->tags['u'].'@is', $tok, $matches))
{ //Opening tag
$ret .= '<u>';
}
elseif (preg_match('@\[/u\]@is', $tok, $matches))
{
$ret .= '</u>'; //Close the tag
}
else
{
$ret .= $tok; //Insert the text
}
}
$this->output = $ret;
return $this->output;
}
public function parseCode($source='')
{
if (empty($source)) $source = $this->source;
$this->tokens = $this->tokenize('code', $source); //Fetch tokens
$ret = '';
if (!$this->checkClosure('code'))
{
$this->output = $source;
return $this->output;
}
$type = false;
$last = 0;
$i = 0;
foreach ($this->tokens as $tok)
{
$i++;
if (preg_match('@'.$this->tags['code'].'@is', $tok, $matches))
{ //Opening code block
if (!empty($matches[1]))
{
switch (strtolower($matches[1]))
{
case 'javascript':
case 'js':
$type = 'js';
break;
default: $type = false;
}
$last = $i;
}
else $type = false;
$ret .= '<div style=" border: 1px solid #AAAAAA; padding: 4px; margin: 4px; background: #FFFFFF;">
<code style="white-space: pre; font-family: courier,monospace; color: #007700;">';
}
elseif (preg_match('@\[/code\]@is', $tok, $matches))
{
$ret .= '</code></div>'; //Close the code box
}
else
{
switch ($type)
{
case 'js':
$js = new JSHighlight(str_replace('<br />', '', $tok));
if ($last == $i-1) $tok = $js->Generate(1);
break;
}
$ret .= $tok; //Insert the text
}
}
$this->output = $ret;
return $this->output;
}
//Just for debugging
public function dumpTokens()
{
echo '<pre>'.print_r($this->tokens, 1).'</pre>';
}
//Internal
protected function getOutput()
{
return $this->output;
}
public function fetchResult()
{
return $this->getOutput();
}
}
?>It tokenizes multiple times and is not stack based, this would not be very efficient with large amounts of text. It also does not allow me to apply features gained from a stack like being able to close open tags, determine if a tag is being used in such a way that it should not be permitted (like nesting url tags)
if i just wanted a parser, i would have written a few regular expressions and be done with it.
if i just wanted a parser, i would have written a few regular expressions and be done with it.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Why not use this http://hp.jpsband.org/ and be done with it? Not BBCode, but I assume that you've already got some BBCode parser in place, so migration would be fairly harmless.