Nested tags a problem when pulling data from HTML [SOLVED]

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
Stryks
Forum Regular
Posts: 746
Joined: Wed Jan 14, 2004 5:06 pm

Nested tags a problem when pulling data from HTML [SOLVED]

Post by Stryks »

Hi all,

I'm trying to lift some data out of an html file. Basicall, I'm trying to get all info between <li class="free"> and it's corresponding </li>

The following works for the most part, however under certain circumstances, there is another <ul><li></li></ul> set nested inside the <li> set I am trying to pull.

It grabs from my opening tag to the first </li>.

The initial regex is:

Code: Select all

preg_match_all('%<li\\sclass="free">(.*?)</li>%si', $html, $result, PREG_SET_ORDER);
But as I said ... that breaks on the first </ul> if finds (and rightly so really).

I had high hopes for

Code: Select all

preg_match_all('%<li\\sclass="free">(.*?(?:<li>.*?</li>)*?)</li>%si', $html, $result, PREG_SET_ORDER);
... but no love.

Any ideas on how to accomplish what I'm after?

Thanks
Last edited by Stryks on Sat Jun 16, 2007 9:22 am, edited 1 time in total.
User avatar
Stryks
Forum Regular
Posts: 746
Joined: Wed Jan 14, 2004 5:06 pm

Post by Stryks »

To clarify a bit ....

How could I take
Some <b> text <b> goes </b> here </b> but <b> shouldn't </b> go here
and pull back
<b> text <b> goes </b> here </b>
and
<b> shouldn't </b>
I can get the first, with

Code: Select all

preg_match_all('%(<b>.*(?:<b>.*?</b>).*?</b>)*%si', $html, $result, PREG_SET_ORDER);
but it doesn't grab the second.

I could do an alternation I suppose, to pick up the other, but I'm never really going to know how many deep the nesting will go.

I dont quite understand why saying "Start at <b> and get everything and include any you find of <b> something </b> however many times it happens to be there and stop at </b>" wouldnt work. My interpretation of that doesnt work, though maybe I just havent expressed it properly.

Code: Select all

preg_match_all('%<b>.*(?:<b>.*?</b>).*</b>%si', $html, $result, PREG_SET_ORDER);
Still stumped ... :?
User avatar
stereofrog
Forum Contributor
Posts: 386
Joined: Mon Dec 04, 2006 6:10 am

Post by stereofrog »

Regular expressions are not suitable for parsing nested structures like html. Recursive parser (e.g. HTML_Sax) is the proper tool for that.
User avatar
Stryks
Forum Regular
Posts: 746
Joined: Wed Jan 14, 2004 5:06 pm

Post by Stryks »

Well .. that took a long time .... but here we are.

The solution (and it looks so damn obvious now) for the example is

Code: Select all

preg_match_all('%<b>(<b>.*?</b>|.)*?</b>%si', $html, $result, PREG_SET_ORDER);

Maybe someone will find this useful.
User avatar
stereofrog
Forum Contributor
Posts: 386
Joined: Mon Dec 04, 2006 6:10 am

Post by stereofrog »

This handles only two levels of nesting and what if you ever have three?
User avatar
Stryks
Forum Regular
Posts: 746
Joined: Wed Jan 14, 2004 5:06 pm

Post by Stryks »

Well ... it's true that you would have to decide at some point what the maximum level of nesting you would handle.

But if you wanted to accept 3 levels, then extending the original would produce something like

Code: Select all

preg_match_all('%<b>(?:<b>(?:<b>.*?</b>|.)*?</b>|.)*?</b>%si', $html, $result, PREG_SET_ORDER);
I havent tested that though. Likewise untested, the following should do 5 levels deep.

Code: Select all

preg_match_all('%<b>(?:<b>(?:<b>(?:<b>(?:<b>(?:<b>.*?</b>|.)*?</b>|.)*?</b>|.)*?</b>|.)*?</b>|.)*?</b>%si', $html, $result, PREG_SET_ORDER);
Anyhow ... it gets me out of my bind for the time being. I have multiple nests, but not more than one level deep so the first option should do the trick.

Cheers all
Post Reply