I am trying to parse the results from a Google page.
After analysing the source code from the Google page, I found out that the URLs are located within this tag:
<li class=g><h3 class="r"><a href="URL HERE"
Code: Select all
//Google Search
$ch = curl_init();
$user_agent='Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.19';
curl_setopt($ch, CURLOPT_URL, 'http://www.google.com/search?q=apple+pie&num=100&hl=en&lr=&as_qdr=all&prmd=ivn&ei=o8GbTKAFhJyWB6ul-csK&start=0&sa=N');
curl_setopt($ch, CURLOPT_POST, 0);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, "my_cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "my_cookies.txt");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$source = curl_exec($ch);
//Extract result from Search
preg_match_all('/<li class=g><h3 class="r"><a href="(.*)"/', $source , $result_array, PREG_SET_ORDER);
// Show first 10 results
for ($x = 0; $x < 10; $x){
echo $result_array[$x][1].'<br>';
} The thing is that it is extracting all the html page each time instead of only the URL.
What is wrong in my regular expression?
Thanks guys!