Page 1 of 1
Nested tags a problem when pulling data from HTML [SOLVED]
Posted: Sat Jun 16, 2007 3:37 am
by Stryks
Hi all,
I'm trying to lift some data out of an html file. Basicall, I'm trying to get all info between <li class="free"> and it's corresponding </li>
The following works for the most part, however under certain circumstances, there is another <ul><li></li></ul> set nested inside the <li> set I am trying to pull.
It grabs from my opening tag to the first </li>.
The initial regex is:
Code: Select all
preg_match_all('%<li\\sclass="free">(.*?)</li>%si', $html, $result, PREG_SET_ORDER);
But as I said ... that breaks on the first </ul> if finds (and rightly so really).
I had high hopes for
Code: Select all
preg_match_all('%<li\\sclass="free">(.*?(?:<li>.*?</li>)*?)</li>%si', $html, $result, PREG_SET_ORDER);
... but no love.
Any ideas on how to accomplish what I'm after?
Thanks
Posted: Sat Jun 16, 2007 4:21 am
by Stryks
To clarify a bit ....
How could I take
Some <b> text <b> goes </b> here </b> but <b> shouldn't </b> go here
and pull back
<b> text <b> goes </b> here </b>
and
<b> shouldn't </b>
I can get the first, with
Code: Select all
preg_match_all('%(<b>.*(?:<b>.*?</b>).*?</b>)*%si', $html, $result, PREG_SET_ORDER);
but it doesn't grab the second.
I could do an alternation I suppose, to pick up the other, but I'm never really going to know how many deep the nesting will go.
I dont quite understand why saying "Start at <b> and get everything and include any you find of <b> something </b> however many times it happens to be there and stop at </b>" wouldnt work. My interpretation of that doesnt work, though maybe I just havent expressed it properly.
Code: Select all
preg_match_all('%<b>.*(?:<b>.*?</b>).*</b>%si', $html, $result, PREG_SET_ORDER);
Still stumped ...

Posted: Sat Jun 16, 2007 9:21 am
by stereofrog
Regular expressions are not suitable for parsing nested structures like html. Recursive parser (e.g. HTML_Sax) is the proper tool for that.
Posted: Sat Jun 16, 2007 9:22 am
by Stryks
Well .. that took a long time .... but here we are.
The solution (and it looks so damn obvious now) for the example is
Code: Select all
preg_match_all('%<b>(<b>.*?</b>|.)*?</b>%si', $html, $result, PREG_SET_ORDER);
Maybe someone will find this useful.
Posted: Sat Jun 16, 2007 2:02 pm
by stereofrog
This handles only two levels of nesting and what if you ever have three?
Posted: Sat Jun 16, 2007 7:52 pm
by Stryks
Well ... it's true that you would have to decide at some point what the maximum level of nesting you would handle.
But if you wanted to accept 3 levels, then extending the original would produce something like
Code: Select all
preg_match_all('%<b>(?:<b>(?:<b>.*?</b>|.)*?</b>|.)*?</b>%si', $html, $result, PREG_SET_ORDER);
I havent tested that though. Likewise untested, the following should do 5 levels deep.
Code: Select all
preg_match_all('%<b>(?:<b>(?:<b>(?:<b>(?:<b>(?:<b>.*?</b>|.)*?</b>|.)*?</b>|.)*?</b>|.)*?</b>|.)*?</b>%si', $html, $result, PREG_SET_ORDER);
Anyhow ... it gets me out of my bind for the time being. I have multiple nests, but not more than one level deep so the first option should do the trick.
Cheers all