Advanced HTML Tag Regex

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
david64
Forum Commoner
Posts: 53
Joined: Sat May 02, 2009 8:12 am
Location: Wales

Advanced HTML Tag Regex

Post by david64 »

Hi, I am using the following regular expression to match HTML tags with the preg (PCRE) functions in PHP:

Code: Select all

/<$element(.*)>(.*)<\/$element>/isU
This works fine for picking up the first part of the tag, but not always for the tag contents. For example, if you have the following HTML:

Code: Select all

<div id="first"><div id="second"></div></div>
The regular expression will return the following for the first div:

Code: Select all

<div id="first"><div id="second"></div>
Is there any way using regex that the full contents of the tag could be returned? Maybe conditions or lookaround?
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Advanced HTML Tag Regex

Post by prometheuzz »

david64 wrote:Hi, I am using the following regular expression to match HTML tags with the preg (PCRE) functions in PHP:

Code: Select all

/<$element(.*)>(.*)<\/$element>/isU
Be very careful with those greedy DOT-STAR or DOT-PLUS monsters: before you know it, they take far more than you might think.
For example, take the string:

Code: Select all

"aaa { bbb } ccc { ddd } eee fff ggg"
and you want to match the brackets with everything in between it ({ bbb } and { ddd }) and you use the regex:

Code: Select all

"/{.*}/"
then you will get just a single match:

Code: Select all

"{ bbb } ccc { ddd }"
To over come this, you will need to make the DOT reluctant instead of greedy. Yuo can do that by adding a question mark after the STAR or PLUS:

Code: Select all

"/{.*?}/"
Or even better, use a negated character class:

Code: Select all

'/{[^{}]*}/'
david64 wrote:This works fine for picking up the first part of the tag, but not always for the tag contents. For example, if you have the following HTML:

Code: Select all

<div id="first"><div id="second"></div></div>
The regular expression will return the following for the first div:

Code: Select all

<div id="first"><div id="second"></div>
Is there any way using regex that the full contents of the tag could be returned? Maybe conditions or lookaround?
No, not with look arounds. Using PCRE, you can capture up to a fixed number of nested tags, but not an arbitrary number of tags. You could do that like this:

Take the string:

Code: Select all

"aaa { bbb { ccc ddd } eee { fff } ggg } hhh"
and you want to match the substring "{ bbb { ccc ddd } eee { fff } ggg }". As you can see it has two other tags in it but only nested once. You could match such a substring with the following regex:

Code: Select all

'/{([^{}]*{[^{}]*})*[^{}]*}/'
But when the nesting is more than one, eg. such a string (nested twice):

Code: Select all

"aaa { bbb { ccc { ddd } } eee fff ggg } hhh"
It will fail and you will have to make a more complicated regex to account for that substring.

But PHP's regex engine has a recursive "feature" in addition to the standard PCRE functionality which will let you match an arbitrary amount of nested tags (the (?R) part in my next example). Perhaps it's a bit mind boggling, but here it is:

Code: Select all

$text = 'aaa <div id="1"> bbb <div id="2"> <div id="3"> ccc </div></div> ddd </div> fff';
preg_match_all('#<div[^>]*>(?:(?:(?!</?div).)*|(?R))*</div>#si', $text, $matches);
print_r($matches);
A (short) explanation:

Code: Select all

<div[^>]*>       // match an opening div-tag
(?:              // open non-capturing group 1
  (?:            //   open non-capturing group 2
    (?!</?div).  //   if there's not an opening- or closing div tag when looking ahead, match any character
  )              //   close non-capturing group 2
  *              //   non-capturing group 2 zero or more times
  |              //   OR
  (?R)           //   a recursive match of this entire regex pattern
)                // close non-capturing group 1
*                // non-capturing group 1 zero or more times
</div>           // match a closing div-tag
But, regex isn't really suited to parse (x)html: when there's an error in the markup, the pattern will most likely cough up invalid matches after that mistake while a true html parser may (and in most cases will) recover from these errors.

Good luck.
Last edited by prometheuzz on Sat May 16, 2009 3:13 pm, edited 1 time in total.
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Re: Advanced HTML Tag Regex

Post by pickle »

Try http://htmlpurifier.org/ - it could save you a lot of time & work.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
david64
Forum Commoner
Posts: 53
Joined: Sat May 02, 2009 8:12 am
Location: Wales

Re: Advanced HTML Tag Regex

Post by david64 »

Thanks for the help. That HTML purifier doesn't do what I want, but it looks useful none the less.
Post Reply