Scraping Search Results with cURL and PHP

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
StabNSprint
Forum Newbie
Posts: 1
Joined: Thu Jul 09, 2009 1:11 pm

Scraping Search Results with cURL and PHP

Post by StabNSprint »

Hi there, I'm relatively new to PHP and was wondering if you guys could help me out.

I'm trying to write some PHP code that performs a search on Google given certain keywords and returns all of the links on the search result page. Right now, I'm using cURL to query the site and then DOM and XPath to parse the HTML and give me the links. Here is the code:

Code: Select all

 
<?php
 
class scraper_google extends scraper_base
{
    public $dom;
    public $hrefs;
    
    public function init($keywords)
    {
        $this->keywords = $keywords;
        
        $this->target_url = 'http://www.google.com/#hl=en&q='
                                .$keywords[0].'&aq=f&oq=&aqi=g10&fp=ADrf44LAAa8';
        echo $this->target_url;
        $this->search_engine = 'www.google.com';
        $this->userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    }
    public function parse_results()
    {
        // make the cURL request to $target_url
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_USERAGENT, $this->userAgent);
        curl_setopt($ch, CURLOPT_URL,$this->target_url);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $html= curl_exec($ch);
        if (!$html) 
        {
            echo "<br />cURL error number:" .curl_errno($ch);
            echo "<br />cURL error:" . curl_error($ch);
            exit;
        }
 
        // parse the html into a DOMDocument
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
 
        // grab all the on the page
        $xpath = new DOMXPath($dom);
        $this->hrefs = $xpath->evaluate("/html//a");
    }
    public function display_results()
    {
        for ($i = 0; $i < $this->hrefs->length; $i++) 
        {
            $href = $this->hrefs->item($i);
            $url = $href->getAttribute('href');
            echo "<br />Link stored: $url";
        }
    }
 
}
 
?>
 
And this is the script that implements it:

<?php

require_once('__root.inc.php');


$scraper = new scraper_google();
$scraper->keywords[0] = "keyword";
$scraper->init($scraper->keywords);
$scraper->parse_results();
$scraper->display_results();

?>


Feel free to try it out yourself. The problem that I'm having is that it gets to the page but is only able to read the header of the result page (with the Google bar up top along with the image, video, and blog search links. I'm guessing the reason for this is because Google AJAXs the search result after the page loads so my question is, is there any way to have access to and parse the page after the search results are displayed?

Thank you.
Post Reply