Page 1 of 2

RegExp: Allow nested matches in a match

Posted: Mon Apr 11, 2005 10:25 am
by vigge89
I'm currently developing a parser for my own content management system. I'm using Perl-Compatible RegExps to parse strings, but I'm stuck right now. The problem I'm having is to allow nested matches inside a match, if you understand what I mean.

The current pattern looks like the following:

Code: Select all

'#<\{if\((.*?)\)\}>(.*?)<\{endif\}>#s
What the string could look like:

Code: Select all

<{if(statement)}>Some text which is outputted if statement equals to true.<{if(statement2)}> This text will be outputted if both statement and statement2 equals to true.<{endif}><{endif}>
The current pattern would just match anything between <{if(statement)}> and the first <{endif}>, and not to the last <{endif}>, as it should.

I need help with making it aware of nested matches. Are there any RegExp Gurus out there who can help me with this?

Thanks in advance //vigge

Posted: Mon Apr 11, 2005 11:10 am
by Chris Corbyn
I'm not sure you'll be able to do that with just one regexp unless the ending tag has something unique about it.

I'll have a look at this for you :wink:

Posted: Mon Apr 11, 2005 11:38 am
by Chris Corbyn
OK, I used preg_match_all() it returns a multidimensional array.
You just need to specify explicitly that the tags can occur any number of times inside the pattern and also explicitly that there must be one at the start and the end.

Need any more RegExp help (or a tweak on this) just ask :wink:

Code: Select all

<?php

$theString = '<{if(statement)}>Some text which is outputted if statement equals to true.<{if(statement2)}> This text will be outputted if both statement and statement2 equals to true.<{endif}><{endif}>';

preg_match_all('#<\{if\(.*?\)\}>((?:.*?(?:<\{if\(.*?\)\}>.*?<\{endif\}>.*?)?)*?)<\{endif\}>#s', $theString, $matches);

echo '<pre>';
print_r($matches);
echo '</pre>';

?>
Outputs

Code: Select all

Array
(
    &#1111;0] =&gt; Array
        (
            &#1111;0] =&gt; &lt;{if(statement)}&gt;Some text which is outputted if statement equals to true.&lt;{if(statement2)}&gt; This text will be outputted if both statement and statement2 equals to true.&lt;{endif}&gt;&lt;{endif}&gt;
        )

    &#1111;1] =&gt; Array
        (
            &#1111;0] =&gt; Some text which is outputted if statement equals to true.&lt;{if(statement2)}&gt; This text will be outputted if both statement and statement2 equals to true.&lt;{endif}&gt;
        )

)

Posted: Mon Apr 11, 2005 12:17 pm
by vigge89
Great, I'll do some debugging and break down the things you've added. I'll post my progress in this topic :)
If anyone's intrested in the parser, just tell me and I could post some example code of it ;)

Posted: Mon Apr 11, 2005 1:20 pm
by Chris Corbyn
DOH.... you could change preg_match_all() to just preg_match() looking at that. You won't have multi-dimensional array that way.

Only use preg_match_all() if you'll be looking for multiple occurences of this.

Sorry for any confusion.

Posted: Mon Apr 11, 2005 1:30 pm
by vigge89
d11wtq wrote:DOH.... you could change preg_match_all() to just preg_match() looking at that. You won't have multi-dimensional array that way.

Only use preg_match_all() if you'll be looking for multiple occurences of this.

Sorry for any confusion.
I'm using preg_replace_callback so I would have to edit it anyway ;)

Posted: Mon Apr 11, 2005 1:31 pm
by Chris Corbyn
Ah ha... that's cool. At least you get the jist of how it goes ;-)

Need any help, ask....

Posted: Mon Apr 11, 2005 1:50 pm
by Bennettman
I've just finished a similar regexp. It works in a slightly different way.

Usage:

Code: Select all

&lt;if=test1&gt;Content &lt;if=test2&gt;More content &lt;/if=test1&gt; &lt;/if=test2&gt;
In the script, $test['test1'], if TRUE, will display "Content", and if $test['test2'] is also true, it'll show it all. So, you have to specify the result inside the script, rather than using actual statements in the design.


Code:

Code: Select all

<?php

// $html is the main content
while (preg_match_all("/<if=([_a-z0-9]+)>(?s)(.*?)(?-s)<\/if=\\1>/i", $html, $temp_if)) {
	for ($i = 0; $temp_if[1][$i]; $i++) {
		$find_if = $temp_if[0][$i];
		$replace_if = ($test[$temp_if[1][$i]] === TRUE) ? $temp_if[2][$i] : "";
	}
	$html = str_replace($find_if, $replace_if, $html);
}

if (I WANT TO SHOW IT ALL!!!) $test['test1'] = $test['test2'] = TRUE;

?>

Posted: Tue Apr 12, 2005 8:37 am
by vigge89
Bennettman wrote:I've just finished a similar regexp. It works in a slightly different way.

Usage:

Code: Select all

&lt;if=test1&gt;Content &lt;if=test2&gt;More content &lt;/if=test1&gt; &lt;/if=test2&gt;
In the script, $test['test1'], if TRUE, will display "Content", and if $test['test2'] is also true, it'll show it all. So, you have to specify the result inside the script, rather than using actual statements in the design.


Code:

Code: Select all

<?php

// $html is the main content
while (preg_match_all("/<if=([_a-z0-9]+)>(?s)(.*?)(?-s)<\/if=\\1>/i", $html, $temp_if)) {
	for ($i = 0; $temp_if[1][$i]; $i++) {
		$find_if = $temp_if[0][$i];
		$replace_if = ($test[$temp_if[1][$i]] === TRUE) ? $temp_if[2][$i] : "";
	}
	$html = str_replace($find_if, $replace_if, $html);
}

if (I WANT TO SHOW IT ALL!!!) $test['test1'] = $test['test2'] = TRUE;

?>
Just a question, what does the (?s) and (?-s) parts in the pattern mean? :)

Posted: Tue Apr 12, 2005 8:59 am
by Chris Corbyn
If my memory serves me correctly (which it probably doesnt)

(?s)pattern(?-s) means ignore whitesoace in pattern

Similarly

(?i)pattern(?-i) means ignore case sensitivity in pattern

Posted: Tue Apr 12, 2005 9:06 am
by Chris Corbyn
You know to be honest I remember it more like

(?spattern)

I'm gonna have to revise this again :?

Posted: Thu Apr 14, 2005 10:17 am
by vigge89
Hmm, after testing the new pattern out (thanks d11wtq!), I find them too slow. I came up with a new idea of enabling nested matches; Let the beginning and ending tags contain an optional key. If the key is included in the open tag, the ending tag must also contain it. I edited my old pattern too make it work like this:

Code: Select all

#<\{if\((.*?)\)(\:&#1111;a-z0-9_]*)?\}>(.*?)<\{endif(\:\\2)?\}>#s
However, thid does not work at all, but I think you'll get the idea of what I'm trying to accomplish.

This one does however work pretty good, but it requires the key:

Code: Select all

#<\{if\((.*?)\):(&#1111;a-z0-9_]*)\}>(.*?)<\{endif:\\2\}>#s
The biggest problem is still that if the match contains a match, the match inside won't be matched.
Is there any way to make the preg_ functions start from the bottom? This would solve all the problems, neither keys or looking for nested matches would be needed since the first match would always be correct.

Thanks for all the replies btw :)

Posted: Thu Apr 14, 2005 10:59 am
by Chris Corbyn
vigge89 wrote:Is there any way to make the preg_ functions start from the bottom? This would solve all the problems, neither keys or looking for nested matches would be needed since the first match would always be correct.
I'm confused by this.... if you started matching from the bottom with no keys, it's just the same as macthing from the top with no keys (isn't it) - since the tags all pair up, it will still hit of match before it gets to the outermost one.

I think we're coming back to the concept of explicitly stating within the regex that these nests can occur.

If you wanna play around with regex going backwards (they dont do that) then use strrev() to reverse the string then strrev() the match it finds once again... this will be slower again I'd guess.

Posted: Fri Apr 15, 2005 10:53 am
by vigge89
The idea I had with it going backwards is that when going backwards, it would find the rightmost starting tags, which always are closed first. Here's an example:
<{if(this_gets_parsed_last)}>
Some text which is shown if the leftmost statement equals to true.
<{if(this_gets_parsed_first)}>
Text which will be shown if both statements equals to true. This block will be tested first, and will either be returned to the left most statement after being parser (the tags for this block will just be stripped), or, the parent block won't contain the text inside this block.
<{endif(this_gets_parsed_first)}>
<{endif(this_gets_parsed_last)}>
The this_gets_parsed_first- and this_gets_parsed_last-statements are only to show which one gets parsed first.

Posted: Sat Apr 16, 2005 11:14 am
by vigge89
Any ideas or solutions? I still can't figure out neither a pattern for the key-ing or regexp-ing backwards :(