Page 1 of 1

HTML - Recursive patterns

Posted: Thu Nov 02, 2006 3:25 am
by []InTeR[]
I have this problem with a html->object program that i need to fix in PHP.

First i'll find the first tag with this preg_match:

Code: Select all

$num = preg_match("/^(.*?)<([a-zA-Z\/][^ ]*?)( [^>]+)?>(.*)$/s", &$content, $startTag);
Then i try to find the closing tag with:

Code: Select all

$num = preg_match("/^(.*?)<.".$this->__preg_quote($startTag[2]).">(.*)$/s", $startTag[4], $endTag);
The $startTag[4] has the rest of the document without the starting tag.
The $startTag[2] has the starting tag.


But when when you have a <table> in a <table> or any other nasted tag with the same name, it will take the wrong 'exit' tag.

Now i looked into this Recursive patterns, but i realy have no clue about how to fix this.

Now i searched allready, and only solutions are solving it with a recursive function. I think you can do this with Recursive patterns as well.

Posted: Thu Nov 02, 2006 4:12 am
by volka

Posted: Thu Nov 02, 2006 4:29 am
by []InTeR[]
I have read the page.

But still have no clue what to do.

I think i need this bit, but i still have no idee what to change in the regex.
A further problem with the scanning approach is that one cannot count on returned markup items to be sufficiently well structured to support extraction of attribute value lists. This represents an impediment to element tag processing in the "normal case," that is, that element tags are correctly formed. In order to remove this impediment, the following expression may be used as the basis for shallow parsing.

ElemTagRE = '<' Name (S Name S? '=' S? AttValSE)* S? '/'? '>'
S = [ \n\t\r]+
And yes, i dont know regex. That maybe adds a problem here.

Posted: Mon Nov 06, 2006 3:44 am
by []InTeR[]
I'm now try'ing to solve this with a 'find-offset' function that returns a offset that the last tag should be after.

Example:
I have left for the html from the 1st preg_mach:

Code: Select all

$content = "<div>Div 1<div id='2'>div 2</div><div id=3>nr 3<div id=4>	nr 4</div></div></div>";
$num = preg_match("/^(.*?)<([a-zA-Z\/][^ ]*?)( [^>]+)?>(.*)$/s", &$content, $startTag); 

// $startTag[4] now is: "Div 1<div id='2'>div 2</div><div id=3>nr 3<div id=4>	nr 4</div></div></div>"
// $startTag[2] now is "div"

$offset = findOffset($startTag[4],$startTag[2]);

// $offset now is 67, the / before the last </div>

$preg = "/^(.*?)<.".$this->__preg_quote($startTag[2]).">(.*)$/s";
// $preg now is "/^(.*?)<.div>(.*)$/s"
$num = $num = preg_match($preg, $startTag[4], $endTag,null,$offset);
Now this code fails. The last pregmatch will not find my last </div>

I'm going to pull some hair out atm.

Posted: Mon Nov 06, 2006 6:33 pm
by Ambush Commander
Why not use one of the XML parsing libraries, such as http://us2.php.net/manual/en/ref.xml.php or http://us2.php.net/manual/en/ref.dom.php ? Doing recursive reg-exps is nasty business, so I really don't recommend it.

If you really must parse HTML, I'd recommend forgoing the regexps completely. Here's a parser that was implemented without regexps: http://hp.jpsband.org/svnroot/htmlpurif ... ectLex.php (note that it has a few dependencies on other classes, so I recommend you download the entire pack: http://hp.jpsband.org/ )

Posted: Tue Nov 07, 2006 3:11 am
by []InTeR[]
Well, those two arent working on our server for some reason.

I will look into the other script.

I got it working at the moment, with two substrings, one to strip before the pregmatch and one to put them back together.

It's not the fastest but it's working for now.