PHP Crawler->Fetching adresses

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
visionmaster
Forum Contributor
Posts: 139
Joined: Wed Jul 14, 2004 4:06 am

PHP Crawler->Fetching adresses

Post by visionmaster »

Hello,

I have to programm a crawler which spiders websites to fetch e.g. adresses. I found a script under

http://phpcrawl.cuab.de/

which can be used as a basis.

But my question is how to actually parse each crawled webpage for an adress. There is no overall, I'll call it meta-rule, to fetch the adresses. I would say that is for sure, or not?

O.k., you can search pages that are titled "contact" e.g. There ist a high probability one finds a adress here. Great, but HOW do I find a adress here?

Thanks!
User avatar
ol4pr0
Forum Regular
Posts: 926
Joined: Thu Jan 08, 2004 11:22 am
Location: ecuador

Post by ol4pr0 »

I do not think you will get many help on this one, not becuase its to hard but because of the app itself ( altho you might be very lucky.)
visionmaster
Forum Contributor
Posts: 139
Joined: Wed Jul 14, 2004 4:06 am

Post by visionmaster »

ol4pr0 wrote:I do not think you will get many help on this one, not becuase its to hard but because of the app itself ( altho you might be very lucky.)
You mean because the application PHPCrawl is unknow

It was just an example, I just wanted some general infos.

Not too hard, are you really sure, or am i just thinking to complicated?
brandan
Forum Commoner
Posts: 37
Joined: Sat Jul 24, 2004 6:39 pm
Location: fort smith, ar

Post by brandan »

probably because you want to use it for spamming purposes!
kettle_drum
DevNet Resident
Posts: 1150
Joined: Sun Jul 20, 2003 9:25 pm
Location: West Yorkshire, England

Post by kettle_drum »

You just use regex and search for other things like href="" and you will find most things.
User avatar
evilmonkey
Forum Regular
Posts: 823
Joined: Sun Oct 06, 2002 1:24 pm
Location: Toronto, Canada

Post by evilmonkey »

brandan wrote:probably because you want to use it for spamming purposes!
Why would you say that? Belive it or not, crawlers and spiders can be used for useful and legitimate purposes...
User avatar
ol4pr0
Forum Regular
Posts: 926
Joined: Thu Jan 08, 2004 11:22 am
Location: ecuador

Post by ol4pr0 »

They could be, but mostly arent..
visionmaster
Forum Contributor
Posts: 139
Joined: Wed Jul 14, 2004 4:06 am

Post by visionmaster »

kettle_drum wrote:You just use regex and search for other things like href="" and you will find most things.
O.k. I know one can do powerful things with RegExps. But crawling means not just to parse the first page, but to go into deep links. That's the thing I don't understand. Parse the first page, search for links an open up the links and so on...


To make it clear, the spider will NOT be used for spamming purposes!
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

maybe check for links and then save the path of the link until the page is finished processing then follow the first link blah blah until no more links then check to see if there we any saved links that you skipped ahead from start over from the saved link?
kettle_drum
DevNet Resident
Posts: 1150
Joined: Sun Jul 20, 2003 9:25 pm
Location: West Yorkshire, England

Post by kettle_drum »

Yes, but you create one function to parse and page, and then you keep calling the function again with the results that the first page gives you. Therefore one function like:

Code: Select all

function parse($address){
      $page = file_get_contents($address);
      $urlpattern = '/((http|https|ftp):\/\/|www)'.'[a-z0-9\-\._]+\/?[a-z0-9_\.\-\?\+\/~=&#;,]*'.'[a-z0-9\/]{1}/si';

      preg_match_all($urlpattern, $page, $matches);
      foreach($matches[0] as $q){
         preg_match("/^(http:\/\/)?([^\/]+)/i", $q, $mat);
         $host = $mat[2];
         echo "Link: $q for Host: $host";
      }
   }
And instead of printing the link or host you then search those addresses and/or save them to a database depending on what your doing.
Post Reply