david64 wrote:Hi, I am using the following regular expression to match HTML tags with the preg (PCRE) functions in PHP:
Code: Select all
/<$element(.*)>(.*)<\/$element>/isU
Be very careful with those greedy DOT-STAR or DOT-PLUS monsters: before you know it, they take far more than you might think.
For example, take the string:
Code: Select all
"aaa { bbb } ccc { ddd } eee fff ggg"
and you want to match the brackets with everything in between it ({ bbb } and { ddd }) and you use the regex:
then you will get just a single match:
To over come this, you will need to make the DOT reluctant instead of greedy. Yuo can do that by adding a question mark after the STAR or PLUS:
Or even better, use a negated character class:
david64 wrote:This works fine for picking up the first part of the tag, but not always for the tag contents. For example, if you have the following HTML:
Code: Select all
<div id="first"><div id="second"></div></div>
The regular expression will return the following for the first div:
Code: Select all
<div id="first"><div id="second"></div>
Is there any way using regex that the full contents of the tag could be returned? Maybe conditions or lookaround?
No, not with look arounds. Using PCRE, you can capture up to a fixed number of nested tags, but not an arbitrary number of tags. You could do that like this:
Take the string:
Code: Select all
"aaa { bbb { ccc ddd } eee { fff } ggg } hhh"
and you want to match the substring "{ bbb { ccc ddd } eee { fff } ggg }". As you can see it has two other tags in it but only nested once. You could match such a substring with the following regex:
But when the nesting is more than one, eg. such a string (nested twice):
Code: Select all
"aaa { bbb { ccc { ddd } } eee fff ggg } hhh"
It will fail and you will have to make a more complicated regex to account for that substring.
But PHP's regex engine has a recursive "feature" in addition to the standard PCRE functionality which will let you match an arbitrary amount of nested tags (the (?R) part in my next example). Perhaps it's a bit mind boggling, but here it is:
Code: Select all
$text = 'aaa <div id="1"> bbb <div id="2"> <div id="3"> ccc </div></div> ddd </div> fff';
preg_match_all('#<div[^>]*>(?:(?:(?!</?div).)*|(?R))*</div>#si', $text, $matches);
print_r($matches);
A (short) explanation:
Code: Select all
<div[^>]*> // match an opening div-tag
(?: // open non-capturing group 1
(?: // open non-capturing group 2
(?!</?div). // if there's not an opening- or closing div tag when looking ahead, match any character
) // close non-capturing group 2
* // non-capturing group 2 zero or more times
| // OR
(?R) // a recursive match of this entire regex pattern
) // close non-capturing group 1
* // non-capturing group 1 zero or more times
</div> // match a closing div-tag
But, regex isn't really suited to parse (x)html: when there's an error in the markup, the pattern will most likely cough up invalid matches after that mistake while a true html parser may (and in most cases will) recover from these errors.
Good luck.