Need to analyze page HTML *after* all redirects

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
trumaj
Forum Newbie
Posts: 3
Joined: Fri Feb 12, 2010 4:53 pm

Need to analyze page HTML *after* all redirects

Post by trumaj »

Hey folks,

I am trying to do a small application which monitors specific URLs to notify when content has changed.

However, some of the URLs the system needs to monitor have 1 or more redirects, so the final page URL is different to the original URL.

I need to be able to read and analyze the HTML in the *FINAL* page, after all redirects have been processed.

I tried the following code, but this only picks up the source code of the original URL.

$contents = file_get_contents($MonitorUrl);
if(strpos($contents, $SearchText)!== false)
{
echo 'found<br/><br/>';
}
else
{
echo 'not found<br/><br/>';
}

Can anyone offer any advice?
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Need to analyze page HTML *after* all redirects

Post by requinix »

Use cURL instead. It's much more powerful than file_get_contents.
trumaj
Forum Newbie
Posts: 3
Joined: Fri Feb 12, 2010 4:53 pm

Re: Need to analyze page HTML *after* all redirects

Post by trumaj »

Thanks for the reply, tasairis.

I'm not at all familiar with cUrl, but I've done some digging around on the net and come up with the following code:

$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);

$ch = curl_init( $MonitorUrl);
curl_setopt_array( $ch, $options );
$contents = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );

But I still seem to have the same issue. On inspection of the page I am using for testing purposes, the redirection is done by means of some Javascript, as follows:

<script language="javascript">window.location.replace("http://tracker02.com/trkv2.asp?C=xxxx&D ... ID=xxxxxxx");</script>

Is there any way to process JavaScript like this using cUrl?

As I say, my objective is simply to arrive at the final page, after any redirects, and simply get the HTML of that page for analysis.

Is this going to be possible??
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Need to analyze page HTML *after* all redirects

Post by requinix »

trumaj wrote:Is there any way to process JavaScript like this using cUrl?
Not really.

Are the redirects predictable?
trumaj
Forum Newbie
Posts: 3
Joined: Fri Feb 12, 2010 4:53 pm

Re: Need to analyze page HTML *after* all redirects

Post by trumaj »

No, not at all.

At the moment I'm trying to parse the <head> tag for redirecting javascript, and to pick up the redirection URL, and then navigating to it directly.

Not very elegant, and I'm not sure it will work in 100% of cases...
Post Reply