analysing the contents of another web page from my own

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
sd3782
Forum Newbie
Posts: 2
Joined: Wed Jul 02, 2008 2:31 pm

analysing the contents of another web page from my own

Post by sd3782 »

Hi

i'm wondering if anyone knows anything about analysing the contents of another web page from my own, just by using the URL. for example (let me make myself clear), one might give the full URL of a website's search results and my website could trawl the html to find a picture and possibly some text related to the picture.

On facebook this is used when you paste an url into a message or a wall post - it shows a picture and some text from the website it refers to.

does anyone even know what this method is called!?

i realise that this may be beyond the scope of this forum but any help would be greatly appreciated. even a point to another forum or website.

i'm basically just looking for a simple method - perhaps using the curl library - or a hint at an algorithm that could be used to scan the html of the desired URL.

i'm relatively new to PHP but i'm quite familiar with javascript. Thank you in advance for any guidance!

Sam
kilermedia
Forum Newbie
Posts: 7
Joined: Wed Jul 02, 2008 11:00 pm
Location: California, USA

Re: analysing the contents of another web page from my own

Post by kilermedia »

You were very close when you mentioned cURL...which is one of the technologies you'll need to utilize to do what you need to do.

What you're specifically looking for is the DOMDocument() PHP class and the DOMXPath() PHP class which you can utilize with cURL to achieve what you need.

Here's something which should get you started on the right path:

Code: Select all

 
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'My Fake Browser Name');
curl_setopt($ch, CURLOPT_URL,'http://www.thesite.com/you/are/reading/from/');
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html){
    $msg = "<br />cURL error number:" .curl_errno($ch);
    $msg .= "<br />cURL error:" . curl_error($ch);
    die($msg);
}
 
$dom = new DOMDocument();
@$dom->loadHTML($html);
 
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//img"); // notice the tags ;d
 
for($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('src');
    echo 'Here, have an image: '.$url.'<br />';
}
?>
Really, really basic but should get you going. :D
sd3782
Forum Newbie
Posts: 2
Joined: Wed Jul 02, 2008 2:31 pm

Re: analysing the contents of another web page from my own

Post by sd3782 »

Hi

Thank you so much for your hasty reply! i'm already expanding upon the solution you provided.

Just one example of an improvement for other readers:

Code: Select all

 
     for($i = 0; $i < $hrefs->length; $i++)
     {
         $href = $hrefs->item($i);
         $url = $href->getAttribute('src');
 
         if( preg_match( "/^\//", $url ) )
         {
           // domain-relative path (begins with /)
           preg_match( "/http:\/\/[^\/]*/", $weburl, $match );
           echo '<img src="'.$match[0].$url.'">';
         }
         else if( preg_match( "/^\./", $url ) )
         {
           // relative path (begins with .)
           preg_match( "/(http:\/\/.*)\//", $weburl, $match );
           echo '<img src="'.$match[0].$url.'">';
         }
         else
         {
           // url is absolute path (begins with http://)
           echo '<img src="'.$url.'">';
         }
     }
 
This ensures all images are displayed - in case the image path is not full.

i'm going to expand on this further to extract only images 'of interest' by looking at the image's title or nearby text.

Thank you again kilermedia and any other advice or suggested expandability will be greatly appreciated!
Post Reply