Using recursive structure to handle encapsulated code

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
lwc
Forum Commoner
Posts: 35
Joined: Thu Jan 11, 2007 11:04 am

Using recursive structure to handle encapsulated code

Post by lwc »

Example #3 of preg_replace_callback() has a way to modify internal code segments before modifying external ones. But it deals with just one tag. Can you help me expand it to multiple tags?

Also, I don't need BB code but actual <custom code>.

My example is:

Code: Select all

$input = "Hi, this is <custom_code1>just <custom_code2>my</custom_code2> <custom_code3>example</custom_code3></custom_code1>. How can it <custom_code4>get <custom_code5>done</custom_code5></custom_code4>?";
I have a function that does different things for what's inside each custom code. So basically codes 2 & 3 need to run before code 1 can. And code 5 needs to run before code 4 can.

Thanks!
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Using recursive structure to handle encapsulated code

Post by ridgerunner »

Be careful what you ask for!

Code: Select all

<?php
// regex to match outermost TAG (which may contain nested TAG having same name)
$re = '% # see: "Mastering Regular Expressions" for "unrolling-the-loop" details
<(\w++)>           # capture TAG name in group 1
(                  # capture TAG contents in group 2
  (?:              # non-capture group for alternation 
    [^<]*+         # begin unrolling-the-loop (normal*)
    (?:            # begin (special normal*)*
      (?!</?\1>)<  # (special)
      [^<]*+       # (normal*)
    )*+            # finish (special normal*)*
  |                # or...
    (?R)           # match a whole nested TAG element
  )*+              # as many as it takes until
)                  # end capture of TAG contents in group 2
</\1>              # match closing TAG
%x';
$input = "Hi, this is <custom_code1>just <custom_code2>my</custom_code2> <custom_code3>example</custom_code3></custom_code1>. How can it <custom_code4>get <custom_code5>done</custom_code5></custom_code4>?";
$counter = 0; // temp global variable used to indicate the processing order
function re_cb($matches) {
    global $re, $counter;
    $contents =& $matches[2]; // $matches[2] contains the contents of this TAG
    if (preg_match($re, $contents)) { // check if any nested tags in contents
        $contents = preg_replace_callback($re, 're_cb', $contents); // yes. recurse
    } // at this point all inner tags have been processed
    // process TAG contents
    return '<' . $matches[1] . '>' . ++$counter . $contents . '</' . $matches[1] . '>';
}
$input = preg_replace_callback($re, 're_cb', $input);
echo $input;
?>
Hope this helps! :)
lwc
Forum Commoner
Posts: 35
Joined: Thu Jan 11, 2007 11:04 am

Re: Using recursive structure to handle encapsulated code

Post by lwc »

Very impressive! While I'm testing my functions on the $matches, I've realized there are also <self closing tags/>.

I've tried adding this after the starting %:

Code: Select all

(<\w+[^>]*/>|
with this before the ending %x:

Code: Select all

)
but it seems to catch standard tags too.

How can your code catch either self closing tags or standard ones?

Thanks!
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Using recursive structure to handle encapsulated code

Post by ridgerunner »

You must be very careful when modifying (advanced) code of this sort. In particular, you must pay attention to the (?R) sub expression, because this matches the entire regex. Changing the opening tag sub-expression will likely have detrimental effects because the original regex is designed to match opening and closing tags and the contents in-between (with nested tags). This regex ignores self closing tags as well as tags having attributes. A self closing tag has no contents and no closing tag and is never nested. Thus, it would be best to handle them separately.

However, you may very well wish to match nested tags having attributes. If you modify the opening tag portion '<(\w++)>', you must also modify the negative lookahead assertion sub expression located inside the "unrolling-the-loop" section as well. Lets say you modify the regex to allow attributes in the opening tags like so: '<(\w++[^>]*+)>'. You can make sure that self closing tags are not matched by using a negative lookbehind like so: '<(\w++[^>]*+(?<!/)>)'. In the unrolling the loop expression, you need to make a similar change (which makes it a bit more complicated). (Note that the negative lookbehind is not essential, but does improve efficiency for subject text that has self closing tags.) We'll also need to capture the attributes in another group so that we can put it all back together again in the callback function. Here is a modified version of the script which handles tags having attributes:

Code: Select all

<?php
// regex to match outermost TAG (which may contain nested TAG having same name)
//  version 2010-03-06 11pm version which handles tags having optional attributes
$re = '% # see: "Mastering Regular Expressions" for "unrolling-the-loop" details
<(\w++)                   # capture TAG name in group 1
([^>]*+)                  # capture optional attributes in group 2
(?<!/)>                   # ensure that this is not a self-closing tag
(                         # capture TAG contents in group 3
  (?:                     # non-capture group for alternation 
    [^<]*+                # begin unrolling-the-loop (normal*)
    (?:                   # begin (special normal*)*
      (?!                 # match the < but only if not start of...
        <\1[^>]*+(?<!/)>  # a nested opening TAG of same species
      |                   # or...
        </\1>             # the closing TAG of the current one
      )<                  # Ok it is neither. match the < (special)
      [^<]*+              # (normal*)
    )*+                   # finish (special normal*)*
  |                       # or...
    (?R)                  # match a whole nested TAG element
  )*+                     # as many as it takes until
)                         # end capture of TAG contents in group 3
</\1>                     # match closing TAG
%x';
$input = 'Hi, this is <custom_code1 att="attribute">just <custom_code2>my</custom_code2> <custom_code3>example</custom_code3></custom_code1>. How can it <custom_code4>get <custom_code5>done</custom_code5></custom_code4>?';
$counter = 0; // temp global variable used to indicate the processing order
function re_cb($matches) {
    global $re, $counter;
    $contents =& $matches[3]; // $matches[3] contains the contents of this TAG
    if (preg_match($re, $contents)) { // check if any nested tags in contents
        $contents = preg_replace_callback($re, 're_cb', $contents); // yes. recurse
    } // at this point all inner tags have been processed
    // process TAG contents
    return '<' . $matches[1] . $matches[2] . '>' . ++$counter . // put it back together
            $contents . '</' . $matches[1] . '>';
}
$input = preg_replace_callback($re, 're_cb', $input);
echo $input;
?>
I would definitely recommend handling the self-closing tags separately. As you probably already know, you can match them like so: '<\w++[^>]*/>'. Note that this one cannot use the possessive + quantifier on the star because the regex engine needs to backtrack one char to match the slash.

You can handle both the self-closing and standard tags in one regex, but things get a bit more complicated. Complex code snippets of this sort (recursive regexes, callback functions and advanced efficiency techniques), are covered in detail in: "Mastering Regular Expressions (3rd Edition)" by Jeffrey Friedl -> (highly recommended)
:)
lwc
Forum Commoner
Posts: 35
Joined: Thu Jan 11, 2007 11:04 am

Re: Using recursive structure to handle encapsulated code

Post by lwc »

That's even more impressive. It's not that I mind running two replacements codes. But this means the standard <custom codes> can't work with <self closing codes/> at the same time.

I must replace self closing ones first in order to support:
<custom code1><custom code2/></custom code1>
But if <custom code1> puts its result in, say, $GLOBALS['x'], and <custom code2/> pulls data from $GLOBALS['x'], then I can't use:
<custom code1>test</custom code1> <custom code2/>
(because <custom code2/> would run before $GLOBALS['x'] is defined)
That's why I need to do the replacements at the same time (external tags, internal tags and self closing tags).
lwc
Forum Commoner
Posts: 35
Joined: Thu Jan 11, 2007 11:04 am

Re: Using recursive structure to handle encapsulated code

Post by lwc »

Alright, I think I've solved it by first replacing any <self-closing tag/> inside the function itself and only later separately:

Code: Select all

<?php
// regex to match outermost TAG (which may contain nested TAG having same name)
//  version 2010-03-06 11pm version which handles tags having optional attributes
$re = '% # see: "Mastering Regular Expressions" for "unrolling-the-loop" details
<(\w++)                   # capture TAG name in group 1
([^>]*+)                  # capture optional attributes in group 2
(?<!/)>                   # ensure that this is not a self-closing tag
(                         # capture TAG contents in group 3
  (?:                     # non-capture group for alternation 
    [^<]*+                # begin unrolling-the-loop (normal*)
    (?:                   # begin (special normal*)*
      (?!                 # match the < but only if not start of...
        <\1[^>]*+(?<!/)>  # a nested opening TAG of same species
      |                   # or...
        </\1>             # the closing TAG of the current one
      )<                  # Ok it is neither. match the < (special)
      [^<]*+              # (normal*)
    )*+                   # finish (special normal*)*
  |                       # or...
    (?R)                  # match a whole nested TAG element
  )*+                     # as many as it takes until
)                         # end capture of TAG contents in group 3
</\1>                     # match closing TAG
%x';
$input = 'Hi, this is <custom_code1 att="attribute">just <custom_code2>my</custom_code2> <custom_code3>example</custom_code3></custom_code1>. How can it <custom_code4>get <custom_code5>done</custom_code5></custom_code4>?';
$counter = 0; // temp global variable used to indicate the processing order
function re_cb($matches) {
    global $re, $counter;
  if (is_array($matches)) {
    $contents =& $matches[3]; // $matches[3] contains the contents of this TAG
    if (preg_match($re, $contents)) { // check if any nested tags in contents
        $contents = preg_replace_callback($re, 're_cb', $contents); // yes. recurse
    } // at this point all inner tags have been processed
    // process TAG contents
$contents = preg_replace('%<\w++[^>]*/>%e', 're_cb("$0")', $contents);
    return '<' . $matches[1] . $matches[2] . '>' . ++$counter . // put it back together
            $contents . '</' . $matches[1] . '>';
  } else {
    $retval = do something with <something/>...
    return $retval;
  }
}
$input = preg_replace_callback($re, 're_cb', $input);
$input = preg_replace('%<\w++[^>]*/>%e', 're_cb("$0")', $input);
echo $input;
?>
Post Reply