Page 1 of 1

Need to analyze page HTML *after* all redirects

Posted: Fri Feb 12, 2010 4:58 pm
by trumaj
Hey folks,

I am trying to do a small application which monitors specific URLs to notify when content has changed.

However, some of the URLs the system needs to monitor have 1 or more redirects, so the final page URL is different to the original URL.

I need to be able to read and analyze the HTML in the *FINAL* page, after all redirects have been processed.

I tried the following code, but this only picks up the source code of the original URL.

$contents = file_get_contents($MonitorUrl);
if(strpos($contents, $SearchText)!== false)
{
echo 'found<br/><br/>';
}
else
{
echo 'not found<br/><br/>';
}

Can anyone offer any advice?

Re: Need to analyze page HTML *after* all redirects

Posted: Fri Feb 12, 2010 6:00 pm
by requinix
Use cURL instead. It's much more powerful than file_get_contents.

Re: Need to analyze page HTML *after* all redirects

Posted: Fri Feb 12, 2010 9:10 pm
by trumaj
Thanks for the reply, tasairis.

I'm not at all familiar with cUrl, but I've done some digging around on the net and come up with the following code:

$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);

$ch = curl_init( $MonitorUrl);
curl_setopt_array( $ch, $options );
$contents = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );

But I still seem to have the same issue. On inspection of the page I am using for testing purposes, the redirection is done by means of some Javascript, as follows:

<script language="javascript">window.location.replace("http://tracker02.com/trkv2.asp?C=xxxx&D ... ID=xxxxxxx");</script>

Is there any way to process JavaScript like this using cUrl?

As I say, my objective is simply to arrive at the final page, after any redirects, and simply get the HTML of that page for analysis.

Is this going to be possible??

Re: Need to analyze page HTML *after* all redirects

Posted: Sat Feb 13, 2010 12:56 am
by requinix
trumaj wrote:Is there any way to process JavaScript like this using cUrl?
Not really.

Are the redirects predictable?

Re: Need to analyze page HTML *after* all redirects

Posted: Sat Feb 13, 2010 12:59 am
by trumaj
No, not at all.

At the moment I'm trying to parse the <head> tag for redirecting javascript, and to pick up the redirection URL, and then navigating to it directly.

Not very elegant, and I'm not sure it will work in 100% of cases...