Page 1 of 1

[SOLVED] Need to Add Another Recursion

Posted: Sat Dec 18, 2010 11:41 am
by Jonah Bron
Hello, world!

I'm working on the CryoBB parser over here, and we've hit a bit of a problem. Here is the expression we have now:

Code: Select all

#\[(%s)(=.+?)?\]((?:[^[]|\[(?!/?\\1(=.+?)?\])|(?R))+)\[/\\1\]#si
This regex matches BBCode tags (while accounting for nesting properly). Many thanks I believe to ridgerunner for (helping in?) getting it working. Now, we want to allow nested BBCode tags inside of the parameter. Here's what I tried:

Code: Select all

#\[(%s)(=(.|(?R))+?)?\]((?:[^[]|\[(?!/?\\1(=.+?)?\])|(?R))+)\[/\\1\]#si
But it behaved exactly the same. How can this be accomplished?

Re: Need to Add Another Recursion

Posted: Sat Dec 18, 2010 12:10 pm
by Jonah Bron
Okay, moving forward. I reversed...

Code: Select all

(.|(?R))+?
...to...

Code: Select all

((?R)|.)+?
When I do that, I can put a parameter in the tag like this:

Code: Select all

[quote=[i ]Bob[/i]]Hi[/quote]
And it works. But if I add more text, like this:

Code: Select all

[quote=[i ]Bob[/i] Johnson]Hi[/quote]
It breaks and shows the parameter as being "n".

Re: Need to Add Another Recursion

Posted: Sat Dec 18, 2010 1:28 pm
by Jonah Bron
More progress. I copy-pasted everything between the ] and the [ (for the bbcode tag) and put it in the parameter area. Here's what I have now.

Code: Select all

#\[(%s)(=((?:[^[]|\[(?!/?\\1(=.+?)?\])|(?R))+))?\]((?:[^[]|\[(?!/?\\1(=.+?)?\])|(?R))+)\[/\\1\]#si
This processes the parameter properly, but it breaks tags inside of it.

Re: Need to Add Another Recursion

Posted: Sat Dec 18, 2010 1:44 pm
by Jonah Bron
Even more progress. Here's the current issue. This regex:

Code: Select all

#\[(%s)(=((?:[^[]|\[(?!/?\\1(=.+?)?\])|(?R))+?))?\]((?:[^[]|\[(?!/?\\1(=.+?)?\])|(?R))+)\[/\\1\]#si
                                             ^
Works with nested tags. This regex:

Code: Select all

#\[(%s)(=((?:[^[]|\[(?!/?\\1(=.+?)?\])|(?R))+))?\]((?:[^[]|\[(?!/?\\1(=.+?)?\])|(?R))+)\[/\\1\]#si
Works with tags in the parameter. It's only the difference of a non-greedy + (marked in the first one). I tried several configurations, but none of them behave as wanted.

Re: Need to Add Another Recursion

Posted: Sat Dec 18, 2010 10:06 pm
by ridgerunner
Why in the world would you want to allow *nested* BBCodes within a BBCode attribute/parameter?

I have a few suggestions for things to do before adding any further complexity:
  • Write out the regex in verbose free-spacing mode with proper indentation for the parentheses levels and add LOTs of comments. (i.e. A comment on every line) (Note that I provided this and several other suggestions on the original thread - all of which still apply: matching all bb tags - regex almost works!)
  • Test your regex on badly formed BBCode (Missing and/or extra square brackets in various locations). Its relatively easy to get a regex to perform well on well-formed input which matches. Be sure that it performs well with non-matches/near-matches as well (non-matches are typically where you can get into trouble with catastrophic backtracking).
  • Test your regex on large subject texts; both well-formed and badly-formed (with certain classes of regex, PHP/PCRE blows up quickly with excessive recursion when applied to larger strings).
You are stretching the limits of a single regex (to put it mildly). I hesitate to use the phrase: "I think this is impossible" - but on first glance, I'm tempted to say just that...

That said, I have to say you are braver than I!

Re: Need to Add Another Recursion

Posted: Sun Dec 19, 2010 12:24 pm
by ridgerunner
Well I couldn't resist...

I've re-formatted your regex in free-spacing mode and added comments. I've also written a new regex which corrects some problems and inefficiencies. It adds the ability to place non-nested BBCodes within the attribute. Here is a script which demonstrates both regexes in action:

Code: Select all

<?php // File: BBCodeAttribute.php Rev:20101219_1000
$data = '[quote=This [i]is[/i] a test!]quote contents here[/quote]';
// Original regex:
$re_orig = '%# Rev:Orig_20101218_1241_A
    # Post: http://forums.devnetwork.net/viewtopic.php?f=38&t=125859#p637912
    \[                     # BBCode opening tag opening delimiter.
    (\w+)                  # $1: Tag name = TAGNAME.
    (=.+?)?                # $2: Optional attribute (can be *anything*!).
    \]                     # BBCode opening tag closing delimiter.
    (                      # $3: Tag contents.
      (?:                  # Non-capture group for content alternatives.
        [^[]               # Anything other than the start of a tag.
      | \[                 # Start of an open or close tag, but only
        (?!/?\1(=.+?)?\])  # if it does not have name=TAGNAME.
      | (?R)               # or recursively match: [TAGNAME*]...[/TAGNAME].
      )+                   # One or more of the content alternatives.
    )                      # End $3. Tag contents.
    \[/\1\]                # BBcode closing tag (name = TAGNAME).
    %six';
// New regex:
$re_new = '%# Rev:New_20101219_1000
    \[                     # BBCode opening tag opening delimiter.
    (\w++|\*)              # $1: Tag name = TAGNAME.
    (=                     # $2: Optional attribute (with non-nested BBCodes).
      [^\]\[]*+            # Any non=[] characters (normal*)
      (?:                  # Use "Unrolling the loop" efficiency technique.
        \[[^\]\[]*+\]      # Allow [matching square brackets] (special)
        [^\]\[]*+          # More any non-[] characters (normal*)
      )*+                  # (See: "Mastering Regular Expressions")
    )?+                    # End $2: Optional attribute.
    \]                     # BBCode opening tag closing delimiter.
    (                      # $3: Tag contents.
      (?:                  # Non-capture group for content alternatives.
        [^[]++             # Anything other than the start of a tag.
      | \[                 # Start of an open or close tag, but only
        (?!/?\1\b)         # if it does not have name=TAGNAME.
      | (?R)               # or recursively match: [TAGNAME*]...[/TAGNAME].
      )++                  # One or more of the content alternatives.
    )                      # End $3. Tag contents.
    \[/\1\]                # BBcode closing tag (name = TAGNAME).
    %x';
printf("Testing original regex:\n");
$count = preg_match_all($re_orig, $data, $matches, PREG_SET_ORDER);
if ($count > 0) {
    printf("%d matches found:\n", $count);
    print_r($matches);
} else {
    printf("No match found.\n");
}
printf("Testing new regex:\n");
$count = preg_match_all($re_new, $data, $matches, PREG_SET_ORDER);
if ($count > 0) {
    printf("%d matches found:\n", $count);
    print_r($matches);
} else {
    printf("No match found.\n");
}
?>
And here is the output from the script:
[text]Testing original regex:
1 matches found:
Array
(
[0] => Array
(
[0] => [quote=This is a test!]quote contents here[/quote]
[1] => quote
[2] => =This [i
[3] => is[/i] a test!]quote contents here
)

)
Testing new regex:
1 matches found:
Array
(
[0] => Array
(
[0] => [quote=This is a test!]quote contents here[/quote]
[1] => quote
[2] => =This is a test!
[3] => quote contents here
)

)[/text]
Some comments:
  • The original regex uses a dot in the attribute value. This dot allows square brackets (and not in a good way), which almost (but not quite) allows nested BBCode tags. See the script output and look at capture group $2.
  • The original regex does not make use of atomic groups or possessive quantifiers. The new regex does - this allows the regex to fail quicker when trying to match mal-formed text.
  • The new regex allows square brackets in the attribute value, but only if they appear in matching pairs and are not nested. This allows for BBCodes within the attribute. However, you will need to further process these separately in the code logic.
  • The new regex employs the "Unrolling the loop" efficiency technique described in Jeffrey Friedl's classic work: Mastering Regular Expressions (3rd Edition)
I recommend careful study of the new regex. Someone on your team should know the stuff (i.e. dig in and study MRE3)

Hope this helps!
:)

Re: Need to Add Another Recursion

Posted: Sun Dec 19, 2010 12:32 pm
by Jonah Bron
ridgerunner wrote:Why in the world would you want to allow *nested* BBCodes within a BBCode attribute/parameter?
Here's an example:
Forum Rules wrote:1. Select the correct board for your query. Take some time to read the guidelines in the sticky topic.
Notice the link in the quoter name.

I'll take a swing at commenting it...

Code: Select all

/
\[                                            # tag opening
(%s)                                          # tag name
(                                             # optional parameter
    =(                                        # either...
        [^[]
        |\[(?!
            \/?\\1(=.+?)?\]
        )
        |(?R)                                 # a repeat of this pattern
    )+                                        # occuring lots of times
)?                                            # parameter optional
\]                                            # tag closing
(
    (?:
        [^[]
        |\[(?!
            \/?\\1(=.+?)?\]
        )
        |(?R)
    )+
)
\[\/\\1\]                                     # end of tag
/six
The uncommented zones are parts I don't understand the logic behind. Ha, look how the flags came out.
ridgerunner wrote:Test your regex on badly formed BBCode (Missing and/or extra square brackets in various locations). Its relatively easy to get a regex to perform well on well-formed input which matches. Be sure that it performs well with non-matches/near-matches as well (non-matches are typically where you can get into trouble with catastrophic backtracking).
Test your regex on large subject texts; both well-formed and badly-formed (with certain classes of regex, PHP/PCRE blows up quickly with excessive recursion when applied to larger strings).
That's a good suggestion. I'll do that when it's working.
ridgerunner wrote:You are stretching the limits of a single regex (to put it mildly). I hesitate to use the phrase: "I think this is impossible" - but on first glance, I'm tempted to say just that...
It does seem like it's bordering on the impossible, but I'm only trying to make it do what it already does, just in a different spot.
ridgerunner wrote:That said, I have to say you are braver than I!
I thought you were the resident regular expression wizard! I'm no expert, I just try to look like I am :wink:

Re: Need to Add Another Recursion

Posted: Sun Dec 19, 2010 1:15 pm
by Jonah Bron
Whiew! Got it working now.

Code: Select all

/
\[						# tag opening
(%s)						# tag name
(						# optional parameter
	=(
		(?:
			(?R)
			|[^[]
			|\[(?!
				\/?\\1(=.+?)?\]
			)
		)+?				# occuring lots of times
	)
)?						# parameter optional
\]						# tag closing
(
	(?:
		[^[]
		|\[(?!
			\/?\\1(=.+?)?\]
		)
		|(?R)
	)+
)
\[\/\\1\]					# end of tag
/six
Thanks