Page 1 of 2

how to extract data in the url

Posted: Tue Feb 02, 2010 4:03 am
by manojsemwal1
hai i have a task where i have to given link and write a script that will extract all data specified in from the All Listings section of the given url. Include subsequent pages where applicable. (i.e. page 2, 3, etc.) .
url is http://yellowpages.superpages.com/listi ... ch=Find+It.

Re: how to extract data in the url

Posted: Tue Feb 02, 2010 6:38 am
by klevis miho
Use the curl function:

For example:

Code: Select all

$url = 'http://yellowpages.superpages.com/listings.jsp?SRC=&STYPE=%20S&PG=L&R=N&L=NY&C=culinary%20schools&N=&T=&S=&search=Find+It';
$ch = curl_init();
$user_agent = 'Mozilla/4.0';
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_TIMEOUT, 120);
//the contents of this url are stored into this variable
$contents = curl_exec($ch);
curl_close($ch);
 
echo $contents; 
 

Re: how to extract data in the url

Posted: Tue Feb 02, 2010 6:40 am
by klevis miho
You can store all the links in an array though, and loop through this array and do the curl function.

Re: how to extract data in the url

Posted: Wed Feb 03, 2010 12:19 am
by manojsemwal1
Thanks klevis miho
i had tried to run but its not displaying any data. pl clear to me iam new in php so i need...........

Thanks.........

Re: how to extract data in the url

Posted: Wed Feb 03, 2010 3:52 am
by klevis miho
try to copy and paste the code that I gave you, and run it.
I just tried it and it displays me the site of that link you told here.

Re: how to extract data in the url

Posted: Wed Feb 03, 2010 4:10 am
by manojsemwal1
Sorry to distrub u..........

i had copy your code and paste it file ..........like...
<?php
error_reporting(E_ALL);
ini_set('display_errors', true);
ini_set('html_errors', false);


$url = 'http://yellowpages.superpages.com/listi ... ch=Find+It';
$ch = curl_init();
$user_agent = 'Mozilla/4.0';
echo "maoj";

curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_TIMEOUT, 120);

//the contents of this url are stored into this variable
$contents = curl_exec($ch);
curl_close($ch);

echo $contents;
echo "manoj";
?>

but its give error
Fatal error: Call to undefined function curl_init() in D:\Apache2.2\htdocs\Test\culnary.php on line 9
pl help to run this programe................

Re: how to extract data in the url

Posted: Wed Feb 03, 2010 4:17 am
by klevis miho
I assume that you do not have curl library installed.

You can use fsockopen instead, but I don't know how to use it, sorry :( .

Or you can ask the hosting provider to install curl.

Re: how to extract data in the url

Posted: Wed Feb 03, 2010 7:53 am
by manojsemwal1
Thanks for your reply on time....
I had tried in another server its working ...fine ...
but how can i store this data in to the txt file..
like Name, weblink, phone no ,etc.....

Re: how to extract data in the url

Posted: Wed Feb 03, 2010 8:03 am
by klevis miho
From the page that is now in the variable $contents you can do a regular expression like this:

preg_match_all('#STRING1(.+?)STRING2#s', $contents, $matches);
$matches = $matches[1];

print_r($matches);

Look at the source code of $contents, then find for example a code like this:
<div class="test"> WHAT YOU NEED </div>

and in the preg_match_all function, instead of STRING1 and STRING2, put <div class="test"> and </div>
preg_match_all('#<div class="test">(.+?)</div>#s', $contents, $matches);

then you will get all the data that is inside STRING1 and STRING2.

I hope I was clear lol

Re: how to extract data in the url

Posted: Fri Feb 05, 2010 12:07 am
by manojsemwal1
hai klevis miho
i had tried to write the txt file it write the full code of the page.in text file.
but i need only the particular word to write ......
when i use print_r($matches);
it shows only array()
give some useful explanations.........

echo $contents;
preg_match_all('Culnary', $contents, $matches);
preg_match_all('/(\d+:\d+)\s*(Culnary)/', $contents, $matches);
$matches = $matches[1];

print_r($matches);
$myFile = "textFile.txt";
$fh = fopen($myFile, 'w') or die("can't open file");
//$stringData = "Bobby Bopper\n";
fwrite($fh, $contents);
fclose($fh);
?>

thanks.....

Re: how to extract data in the url

Posted: Fri Feb 05, 2010 3:32 am
by klevis miho
It shows only array() because the preg_match_all didn't find anything.

Try this:

preg_match_all('#<h3 class="nmwehtszdf nmwehtclrdf ">(.+?)</h3>#s', $content, $matches);

print_r($matches);

This will get you(from the link you posted here):

1.Art Institute Online
2.Culinary Schools
3.Culinary Schools
4.The Art Institutes

You know what I mean?

Because this preg_match_all will get anything that is inside <h3 class="nmwehtszdf nmwehtclrdf "> and </h3>

Try it out exactly as I posted here.

Re: how to extract data in the url

Posted: Mon Feb 08, 2010 1:47 am
by manojsemwal1
i do same its print but unable to write the txt file ...
preg_match_all('#<h3 class="nmwehtszdf nmwehtclrdf ">(.+?)</h3>#s', $contents, $matches);

print_r($matches);
$myFile = "textFile.txt";
$fh = fopen($myFile, 'a') or die("can't open file");
fwrite($fh, $matches);
fclose($fh);

in textfile result is...

Array

if i use fwrite($fh, $content);
its print whole page script in txtfile................

so how can we insert only revelent data..........

Re: how to extract data in the url

Posted: Mon Feb 08, 2010 3:13 am
by klevis miho
ok after the

preg_match_all('#<h3 class="nmwehtszdf nmwehtclrdf ">(.+?)</h3>#s', $contents, $matches);

insert
$matches = $matches[1];
then

foreach($matches as $values) {
$myFile = "textFile.txt";
$fh = fopen($myFile, 'a') or die("can't open file");
fwrite($fh, $values);
fclose($fh);
}

Re: how to extract data in the url

Posted: Mon Feb 08, 2010 7:31 am
by manojsemwal1
Thanks sir,
Its Printing but its also print unnecessary thing like.
<a href="http://www.superpages.com/bp/Massapequa ... 761070.htm" onClick='setLSBCookie12();clickTrackLRAction("busName",";2157152125","FF","3","15","Advertised listing","","yp_listings"); this.href = "http://clicks.superpages.com/ct/clickTh ... ofile&LOC=" + "http://www.superpages.com/bp/Massapequa ... '">Kitchen Time's Party Place</a>
<a href="http://www.superpages.com/bp/Denver-CO/ ... 867136.htm" onClick='setLSBCookie13();clickTrackLRAction("busName",";0002761070","FF","4","15","Advertised listing","","yp_listings"); this.href = "http://clicks.superpages.com/ct/clickTh ... ofile&LOC=" + "http://www.superpages.com/bp/Denver-CO/ ... 12"'">Cook Street School of Fine Cooking</a>
<a href="http://www.superpages.com/bp/New-York-N ... 395260.htm" onClick='setLSBCookie14();clickTrackLRAction("busName",";0016867136","FF","5","15","Advertised listing","","yp_listings"); this.href = "http://clicks.superpages.com/ct/clickTh ... ofile&LOC=" + "http://www.superpages.com/bp/New-York-N ... 612"'">New York Restaurant School</a>

how can i avoid this?
and also i want print webaddress and phone no. we have to use again

preg_match_all('#<h3 class="nmwehtszdf nmwehtclrdf ">(.+?)</h3>#s', $contents, $matches);

this each parameters like addess and phone no etc. or same we can use it.........

Re: how to extract data in the url

Posted: Mon Feb 08, 2010 9:12 am
by klevis miho
Ok, after the $matches = $matches[1]; and before the foreach do this:

echo '<pre>';
print_r($matches);
echo '</pre>';

$matches is an array which holds every occurence of strings that are inside <h3 class="nmwehtszdf nmwehtclrdf "> and </h3> you know?

now a maybe fix to not display all this rubbish is to do this in the foreach loop(add strip_tags($values)):
modify the foreach like this:

foreach($matches as $val) {
$values = strip_tags($val);
$myFile = "textFile.txt";
$fh = fopen($myFile, 'a') or die("can't open file");
fwrite($fh, $val);
fclose($fh);
}

this should fix something