Page 1 of 1

analysing the contents of another web page from my own

Posted: Wed Jul 02, 2008 3:01 pm
by sd3782
Hi

i'm wondering if anyone knows anything about analysing the contents of another web page from my own, just by using the URL. for example (let me make myself clear), one might give the full URL of a website's search results and my website could trawl the html to find a picture and possibly some text related to the picture.

On facebook this is used when you paste an url into a message or a wall post - it shows a picture and some text from the website it refers to.

does anyone even know what this method is called!?

i realise that this may be beyond the scope of this forum but any help would be greatly appreciated. even a point to another forum or website.

i'm basically just looking for a simple method - perhaps using the curl library - or a hint at an algorithm that could be used to scan the html of the desired URL.

i'm relatively new to PHP but i'm quite familiar with javascript. Thank you in advance for any guidance!

Sam

Re: analysing the contents of another web page from my own

Posted: Wed Jul 02, 2008 11:28 pm
by kilermedia
You were very close when you mentioned cURL...which is one of the technologies you'll need to utilize to do what you need to do.

What you're specifically looking for is the DOMDocument() PHP class and the DOMXPath() PHP class which you can utilize with cURL to achieve what you need.

Here's something which should get you started on the right path:

Code: Select all

 
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'My Fake Browser Name');
curl_setopt($ch, CURLOPT_URL,'http://www.thesite.com/you/are/reading/from/');
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html){
    $msg = "<br />cURL error number:" .curl_errno($ch);
    $msg .= "<br />cURL error:" . curl_error($ch);
    die($msg);
}
 
$dom = new DOMDocument();
@$dom->loadHTML($html);
 
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//img"); // notice the tags ;d
 
for($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('src');
    echo 'Here, have an image: '.$url.'<br />';
}
?>
Really, really basic but should get you going. :D

Re: analysing the contents of another web page from my own

Posted: Thu Jul 03, 2008 6:55 pm
by sd3782
Hi

Thank you so much for your hasty reply! i'm already expanding upon the solution you provided.

Just one example of an improvement for other readers:

Code: Select all

 
     for($i = 0; $i < $hrefs->length; $i++)
     {
         $href = $hrefs->item($i);
         $url = $href->getAttribute('src');
 
         if( preg_match( "/^\//", $url ) )
         {
           // domain-relative path (begins with /)
           preg_match( "/http:\/\/[^\/]*/", $weburl, $match );
           echo '<img src="'.$match[0].$url.'">';
         }
         else if( preg_match( "/^\./", $url ) )
         {
           // relative path (begins with .)
           preg_match( "/(http:\/\/.*)\//", $weburl, $match );
           echo '<img src="'.$match[0].$url.'">';
         }
         else
         {
           // url is absolute path (begins with http://)
           echo '<img src="'.$url.'">';
         }
     }
 
This ensures all images are displayed - in case the image path is not full.

i'm going to expand on this further to extract only images 'of interest' by looking at the image's title or nearby text.

Thank you again kilermedia and any other advice or suggested expandability will be greatly appreciated!