PHP Crawler->Fetching adresses
Moderator: General Moderators
-
visionmaster
- Forum Contributor
- Posts: 139
- Joined: Wed Jul 14, 2004 4:06 am
PHP Crawler->Fetching adresses
Hello,
I have to programm a crawler which spiders websites to fetch e.g. adresses. I found a script under
http://phpcrawl.cuab.de/
which can be used as a basis.
But my question is how to actually parse each crawled webpage for an adress. There is no overall, I'll call it meta-rule, to fetch the adresses. I would say that is for sure, or not?
O.k., you can search pages that are titled "contact" e.g. There ist a high probability one finds a adress here. Great, but HOW do I find a adress here?
Thanks!
I have to programm a crawler which spiders websites to fetch e.g. adresses. I found a script under
http://phpcrawl.cuab.de/
which can be used as a basis.
But my question is how to actually parse each crawled webpage for an adress. There is no overall, I'll call it meta-rule, to fetch the adresses. I would say that is for sure, or not?
O.k., you can search pages that are titled "contact" e.g. There ist a high probability one finds a adress here. Great, but HOW do I find a adress here?
Thanks!
-
visionmaster
- Forum Contributor
- Posts: 139
- Joined: Wed Jul 14, 2004 4:06 am
You mean because the application PHPCrawl is unknowol4pr0 wrote:I do not think you will get many help on this one, not becuase its to hard but because of the app itself ( altho you might be very lucky.)
It was just an example, I just wanted some general infos.
Not too hard, are you really sure, or am i just thinking to complicated?
-
kettle_drum
- DevNet Resident
- Posts: 1150
- Joined: Sun Jul 20, 2003 9:25 pm
- Location: West Yorkshire, England
- evilmonkey
- Forum Regular
- Posts: 823
- Joined: Sun Oct 06, 2002 1:24 pm
- Location: Toronto, Canada
-
visionmaster
- Forum Contributor
- Posts: 139
- Joined: Wed Jul 14, 2004 4:06 am
O.k. I know one can do powerful things with RegExps. But crawling means not just to parse the first page, but to go into deep links. That's the thing I don't understand. Parse the first page, search for links an open up the links and so on...kettle_drum wrote:You just use regex and search for other things like href="" and you will find most things.
To make it clear, the spider will NOT be used for spamming purposes!
- John Cartwright
- Site Admin
- Posts: 11470
- Joined: Tue Dec 23, 2003 2:10 am
- Location: Toronto
- Contact:
-
kettle_drum
- DevNet Resident
- Posts: 1150
- Joined: Sun Jul 20, 2003 9:25 pm
- Location: West Yorkshire, England
Yes, but you create one function to parse and page, and then you keep calling the function again with the results that the first page gives you. Therefore one function like:
And instead of printing the link or host you then search those addresses and/or save them to a database depending on what your doing.
Code: Select all
function parse($address){
$page = file_get_contents($address);
$urlpattern = '/((http|https|ftp):\/\/|www)'.'[a-z0-9\-\._]+\/?[a-z0-9_\.\-\?\+\/~=&#;,]*'.'[a-z0-9\/]{1}/si';
preg_match_all($urlpattern, $page, $matches);
foreach($matches[0] as $q){
preg_match("/^(http:\/\/)?([^\/]+)/i", $q, $mat);
$host = $mat[2];
echo "Link: $q for Host: $host";
}
}