I'm trying to write some PHP code that performs a search on Google given certain keywords and returns all of the links on the search result page. Right now, I'm using cURL to query the site and then DOM and XPath to parse the HTML and give me the links. Here is the code:
Code: Select all
<?php
class scraper_google extends scraper_base
{
public $dom;
public $hrefs;
public function init($keywords)
{
$this->keywords = $keywords;
$this->target_url = 'http://www.google.com/#hl=en&q='
.$keywords[0].'&aq=f&oq=&aqi=g10&fp=ADrf44LAAa8';
echo $this->target_url;
$this->search_engine = 'www.google.com';
$this->userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
}
public function parse_results()
{
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $this->userAgent);
curl_setopt($ch, CURLOPT_URL,$this->target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html)
{
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$this->hrefs = $xpath->evaluate("/html//a");
}
public function display_results()
{
for ($i = 0; $i < $this->hrefs->length; $i++)
{
$href = $this->hrefs->item($i);
$url = $href->getAttribute('href');
echo "<br />Link stored: $url";
}
}
}
?>
<?php
require_once('__root.inc.php');
$scraper = new scraper_google();
$scraper->keywords[0] = "keyword";
$scraper->init($scraper->keywords);
$scraper->parse_results();
$scraper->display_results();
?>
Feel free to try it out yourself. The problem that I'm having is that it gets to the page but is only able to read the header of the result page (with the Google bar up top along with the image, video, and blog search links. I'm guessing the reason for this is because Google AJAXs the search result after the page loads so my question is, is there any way to have access to and parse the page after the search results are displayed?
Thank you.