Page 1 of 1
PHP Crawler->Fetching adresses
Posted: Wed Jul 28, 2004 3:42 pm
by visionmaster
Hello,
I have to programm a crawler which spiders websites to fetch e.g. adresses. I found a script under
http://phpcrawl.cuab.de/
which can be used as a basis.
But my question is how to actually parse each crawled webpage for an adress. There is no overall, I'll call it meta-rule, to fetch the adresses. I would say that is for sure, or not?
O.k., you can search pages that are titled "contact" e.g. There ist a high probability one finds a adress here. Great, but HOW do I find a adress here?
Thanks!
Posted: Wed Jul 28, 2004 3:47 pm
by ol4pr0
I do not think you will get many help on this one, not becuase its to hard but because of the app itself ( altho you might be very lucky.)
Posted: Wed Jul 28, 2004 4:26 pm
by visionmaster
ol4pr0 wrote:I do not think you will get many help on this one, not becuase its to hard but because of the app itself ( altho you might be very lucky.)
You mean because the application PHPCrawl is unknow
It was just an example, I just wanted some general infos.
Not too hard, are you really sure, or am i just thinking to complicated?
Posted: Wed Jul 28, 2004 4:42 pm
by brandan
probably because you want to use it for spamming purposes!
Posted: Wed Jul 28, 2004 7:27 pm
by kettle_drum
You just use regex and search for other things like href="" and you will find most things.
Posted: Wed Jul 28, 2004 9:02 pm
by evilmonkey
brandan wrote:probably because you want to use it for spamming purposes!
Why would you say that? Belive it or not, crawlers and spiders can be used for useful and legitimate purposes...
Posted: Wed Jul 28, 2004 11:21 pm
by ol4pr0
They could be, but mostly arent..
Posted: Thu Jul 29, 2004 3:09 am
by visionmaster
kettle_drum wrote:You just use regex and search for other things like href="" and you will find most things.
O.k. I know one can do powerful things with RegExps. But crawling means not just to parse the first page, but to go into deep links. That's the thing I don't understand. Parse the first page, search for links an open up the links and so on...
To make it clear, the spider will NOT be used for spamming purposes!
Posted: Thu Jul 29, 2004 3:11 am
by John Cartwright
maybe check for links and then save the path of the link until the page is finished processing then follow the first link blah blah until no more links then check to see if there we any saved links that you skipped ahead from start over from the saved link?
Posted: Thu Jul 29, 2004 3:13 am
by kettle_drum
Yes, but you create one function to parse and page, and then you keep calling the function again with the results that the first page gives you. Therefore one function like:
Code: Select all
function parse($address){
$page = file_get_contents($address);
$urlpattern = '/((http|https|ftp):\/\/|www)'.'[a-z0-9\-\._]+\/?[a-z0-9_\.\-\?\+\/~=&#;,]*'.'[a-z0-9\/]{1}/si';
preg_match_all($urlpattern, $page, $matches);
foreach($matches[0] as $q){
preg_match("/^(http:\/\/)?([^\/]+)/i", $q, $mat);
$host = $mat[2];
echo "Link: $q for Host: $host";
}
}
And instead of printing the link or host you then search those addresses and/or save them to a database depending on what your doing.