Page 1 of 1

I will look more in this forum wait before anwer

Posted: Fri Nov 27, 2009 5:33 am
by moussa854
Happy thanks giving to all:
I know that might sound simple to some of you but I tried to look for an answer and did not find one. How to extract data from a web page looking for certain tag "<a class="gs-title">" inside div with an ID "<div id="123">"

Thanks

Re: I will look more in this forum wait before anwer

Posted: Fri Nov 27, 2009 12:32 pm
by ridgerunner
This will require a two step process:
  1. Match and save the contents of the <div id="123"> tag.
  2. Search the DIV tag contents and match the <a class="gs-title">
1. The first step is kind of tricky because a DIV tag can contain nested DIV tags. In the recent Preg_match_all to get <div> tag contents thread, I posted a recursive regular expression solution that is very similar to the one we need here. Here is a regex (in both long and short versions) that matches the contents of a <div id="123"> tag (which may contain nested DIV tags):

Code: Select all

$pattern1_long = '%  # recursive regex to capture contents of id="123" DIV
<div\s+[^>]*?id="123"[^>]*>           # match the id="123" DIV opening tag
  (                                   # capture DIV contents into $1
    (?:                               # non-cap group for nesting * quantifier
      (?: (?!<div[^>]*>|</div>). )++  # possessively match all non-DIV tag chars
    |                                 # or 
      <div[^>]*>(?1)</div>            # recursively match nested <div>xyz</div>
    )*                                # loop however deep as necessary
  )                                   # end group 1 capture
</div>                                # match the id="123" DIV closing tag
%six';  // single-line (dot matches all), ignore case and free spacing modes ON
 
$pattern1_short = '%<div\s+[^>]*?id="123"[^>]*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?1)</div>)*)</div>%si';
2. If the first regex is successful, then we need to search the contents of the DIV tag (captureed in group 1) for the <a class="gs-title"> sequence with the following regex:

Code: Select all

$pattern2 = '%<a\s+[^>]*?class="gs-title"[^>]*>%i';
You didn't say what you wanted to do with the match. Assuming you just want to test to see if both contitions are true for a given text string, here is a PHP function that should do the trick:

Code: Select all

<?php // File: MatchAllDiv123.php
 
function MatchID123($data) {
    $pattern1 = '%<div\s+[^>]*?id="123"[^>]*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?1)</div>)*)</div>%si';
    $pattern2 = '%<a\s+[^>]*?class="gs-title"[^>]*>%i';
    $matchcount = preg_match_all($pattern1, $data, $matches);
    if ($matchcount > 0) {
        for ($i = 0; $i < $matchcount; $i++) {
            $div_contents = $matches[1][$i];
            if (preg_match($pattern2, $div_contents)) return TRUE;
        }
    }
    return FALSE;
}
 
// Read html file to be processed into $data variable
$data = file_get_contents('MatchAllDiv123_TestData.html');
 
if (MatchID123($data)) {
    echo("Match found");
} else {
    echo('No match');
}
?>
Hope this helps!

(Note that using PCRE recursive regexes (like the first one above) can quickly eat up all the system RAM memory and fail if the test subject is large - especially on Windows boxes. This is due to the recursive regexes needing *LOTS* of stack space.)

Re: I will look more in this forum wait before anwer

Posted: Fri Nov 27, 2009 2:39 pm
by califdon
Ridgerunner: please follow forum guidelines about meaningful Subject lines!

[Edit:] Woops! My apologies to Ridegerunner, that comment was intended for the original poster, mousa854! Sorry!

Re: I will look more in this forum wait before anwer

Posted: Fri Nov 27, 2009 5:00 pm
by moussa854
Thanks for your reply, I looked in the internet 'Google' and then I decided to come here as it has been useful to me, then I saw the previous post and I wanted to look more in this forum, thank you for your reply.

Happy Holiday for all