You must be very careful when modifying (advanced) code of this sort. In particular, you must pay attention to the (?R) sub expression, because this matches the entire regex. Changing the opening tag sub-expression will likely have detrimental effects because the original regex is designed to match opening and closing tags and the contents in-between (with nested tags). This regex ignores self closing tags as well as tags having attributes. A self closing tag has no contents and no closing tag and is never nested. Thus, it would be best to handle them separately.
However, you may very well wish to match nested tags having attributes. If you modify the opening tag portion '<(\w++)>', you must also modify the negative lookahead assertion sub expression located inside the
"unrolling-the-loop" section as well. Lets say you modify the regex to allow attributes in the opening tags like so: '<(\w++[^>]*+)>'. You can make sure that self closing tags are not matched by using a negative lookbehind like so: '<(\w++[^>]*+(?<!/)>)'. In the unrolling the loop expression, you need to make a similar change (which makes it a bit more complicated). (Note that the negative lookbehind is not essential, but does improve efficiency for subject text that has self closing tags.) We'll also need to capture the attributes in another group so that we can put it all back together again in the callback function. Here is a modified version of the script which handles tags having attributes:
Code: Select all
<?php
// regex to match outermost TAG (which may contain nested TAG having same name)
// version 2010-03-06 11pm version which handles tags having optional attributes
$re = '% # see: "Mastering Regular Expressions" for "unrolling-the-loop" details
<(\w++) # capture TAG name in group 1
([^>]*+) # capture optional attributes in group 2
(?<!/)> # ensure that this is not a self-closing tag
( # capture TAG contents in group 3
(?: # non-capture group for alternation
[^<]*+ # begin unrolling-the-loop (normal*)
(?: # begin (special normal*)*
(?! # match the < but only if not start of...
<\1[^>]*+(?<!/)> # a nested opening TAG of same species
| # or...
</\1> # the closing TAG of the current one
)< # Ok it is neither. match the < (special)
[^<]*+ # (normal*)
)*+ # finish (special normal*)*
| # or...
(?R) # match a whole nested TAG element
)*+ # as many as it takes until
) # end capture of TAG contents in group 3
</\1> # match closing TAG
%x';
$input = 'Hi, this is <custom_code1 att="attribute">just <custom_code2>my</custom_code2> <custom_code3>example</custom_code3></custom_code1>. How can it <custom_code4>get <custom_code5>done</custom_code5></custom_code4>?';
$counter = 0; // temp global variable used to indicate the processing order
function re_cb($matches) {
global $re, $counter;
$contents =& $matches[3]; // $matches[3] contains the contents of this TAG
if (preg_match($re, $contents)) { // check if any nested tags in contents
$contents = preg_replace_callback($re, 're_cb', $contents); // yes. recurse
} // at this point all inner tags have been processed
// process TAG contents
return '<' . $matches[1] . $matches[2] . '>' . ++$counter . // put it back together
$contents . '</' . $matches[1] . '>';
}
$input = preg_replace_callback($re, 're_cb', $input);
echo $input;
?>
I would definitely recommend handling the self-closing tags separately. As you probably already know, you can match them like so: '<\w++[^>]*/>'. Note that this one cannot use the possessive + quantifier on the star because the regex engine needs to backtrack one char to match the slash.
You can handle both the self-closing and standard tags in one regex, but things get a bit more complicated. Complex code snippets of this sort (recursive regexes, callback functions and advanced efficiency techniques), are covered in detail in:
"Mastering Regular Expressions (3rd Edition)" by Jeffrey Friedl -> (
highly recommended)
