Happy thanks giving to all:
I know that might sound simple to some of you but I tried to look for an answer and did not find one. How to extract data from a web page looking for certain tag "<a class="gs-title">" inside div with an ID "<div id="123">"
Thanks
I will look more in this forum wait before anwer
Moderator: General Moderators
- ridgerunner
- Forum Contributor
- Posts: 214
- Joined: Sun Jul 05, 2009 10:39 pm
- Location: SLC, UT
Re: I will look more in this forum wait before anwer
This will require a two step process:
2. If the first regex is successful, then we need to search the contents of the DIV tag (captureed in group 1) for the <a class="gs-title"> sequence with the following regex:
You didn't say what you wanted to do with the match. Assuming you just want to test to see if both contitions are true for a given text string, here is a PHP function that should do the trick:
Hope this helps!
(Note that using PCRE recursive regexes (like the first one above) can quickly eat up all the system RAM memory and fail if the test subject is large - especially on Windows boxes. This is due to the recursive regexes needing *LOTS* of stack space.)
- Match and save the contents of the <div id="123"> tag.
- Search the DIV tag contents and match the <a class="gs-title">
Code: Select all
$pattern1_long = '% # recursive regex to capture contents of id="123" DIV
<div\s+[^>]*?id="123"[^>]*> # match the id="123" DIV opening tag
( # capture DIV contents into $1
(?: # non-cap group for nesting * quantifier
(?: (?!<div[^>]*>|</div>). )++ # possessively match all non-DIV tag chars
| # or
<div[^>]*>(?1)</div> # recursively match nested <div>xyz</div>
)* # loop however deep as necessary
) # end group 1 capture
</div> # match the id="123" DIV closing tag
%six'; // single-line (dot matches all), ignore case and free spacing modes ON
$pattern1_short = '%<div\s+[^>]*?id="123"[^>]*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?1)</div>)*)</div>%si';Code: Select all
$pattern2 = '%<a\s+[^>]*?class="gs-title"[^>]*>%i';Code: Select all
<?php // File: MatchAllDiv123.php
function MatchID123($data) {
$pattern1 = '%<div\s+[^>]*?id="123"[^>]*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?1)</div>)*)</div>%si';
$pattern2 = '%<a\s+[^>]*?class="gs-title"[^>]*>%i';
$matchcount = preg_match_all($pattern1, $data, $matches);
if ($matchcount > 0) {
for ($i = 0; $i < $matchcount; $i++) {
$div_contents = $matches[1][$i];
if (preg_match($pattern2, $div_contents)) return TRUE;
}
}
return FALSE;
}
// Read html file to be processed into $data variable
$data = file_get_contents('MatchAllDiv123_TestData.html');
if (MatchID123($data)) {
echo("Match found");
} else {
echo('No match');
}
?>(Note that using PCRE recursive regexes (like the first one above) can quickly eat up all the system RAM memory and fail if the test subject is large - especially on Windows boxes. This is due to the recursive regexes needing *LOTS* of stack space.)
Re: I will look more in this forum wait before anwer
Ridgerunner: please follow forum guidelines about meaningful Subject lines!
[Edit:] Woops! My apologies to Ridegerunner, that comment was intended for the original poster, mousa854! Sorry!
[Edit:] Woops! My apologies to Ridegerunner, that comment was intended for the original poster, mousa854! Sorry!
Last edited by califdon on Fri Nov 27, 2009 7:58 pm, edited 1 time in total.
Reason: To correct my error in users.
Reason: To correct my error in users.
Re: I will look more in this forum wait before anwer
Thanks for your reply, I looked in the internet 'Google' and then I decided to come here as it has been useful to me, then I saw the previous post and I wanted to look more in this forum, thank you for your reply.
Happy Holiday for all
Happy Holiday for all