Hey folks,
I am trying to do a small application which monitors specific URLs to notify when content has changed.
However, some of the URLs the system needs to monitor have 1 or more redirects, so the final page URL is different to the original URL.
I need to be able to read and analyze the HTML in the *FINAL* page, after all redirects have been processed.
I tried the following code, but this only picks up the source code of the original URL.
$contents = file_get_contents($MonitorUrl);
if(strpos($contents, $SearchText)!== false)
{
echo 'found<br/><br/>';
}
else
{
echo 'not found<br/><br/>';
}
Can anyone offer any advice?
Need to analyze page HTML *after* all redirects
Moderator: General Moderators
Re: Need to analyze page HTML *after* all redirects
Use cURL instead. It's much more powerful than file_get_contents.
Re: Need to analyze page HTML *after* all redirects
Thanks for the reply, tasairis.
I'm not at all familiar with cUrl, but I've done some digging around on the net and come up with the following code:
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $MonitorUrl);
curl_setopt_array( $ch, $options );
$contents = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
But I still seem to have the same issue. On inspection of the page I am using for testing purposes, the redirection is done by means of some Javascript, as follows:
<script language="javascript">window.location.replace("http://tracker02.com/trkv2.asp?C=xxxx&D ... ID=xxxxxxx");</script>
Is there any way to process JavaScript like this using cUrl?
As I say, my objective is simply to arrive at the final page, after any redirects, and simply get the HTML of that page for analysis.
Is this going to be possible??
I'm not at all familiar with cUrl, but I've done some digging around on the net and come up with the following code:
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $MonitorUrl);
curl_setopt_array( $ch, $options );
$contents = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
But I still seem to have the same issue. On inspection of the page I am using for testing purposes, the redirection is done by means of some Javascript, as follows:
<script language="javascript">window.location.replace("http://tracker02.com/trkv2.asp?C=xxxx&D ... ID=xxxxxxx");</script>
Is there any way to process JavaScript like this using cUrl?
As I say, my objective is simply to arrive at the final page, after any redirects, and simply get the HTML of that page for analysis.
Is this going to be possible??
Re: Need to analyze page HTML *after* all redirects
Not really.trumaj wrote:Is there any way to process JavaScript like this using cUrl?
Are the redirects predictable?
Re: Need to analyze page HTML *after* all redirects
No, not at all.
At the moment I'm trying to parse the <head> tag for redirecting javascript, and to pick up the redirection URL, and then navigating to it directly.
Not very elegant, and I'm not sure it will work in 100% of cases...
At the moment I'm trying to parse the <head> tag for redirecting javascript, and to pick up the redirection URL, and then navigating to it directly.
Not very elegant, and I'm not sure it will work in 100% of cases...