spider idea

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
itsmani1
Forum Regular
Posts: 791
Joined: Mon Sep 29, 2003 2:26 am
Location: Islamabad Pakistan
Contact:

spider idea

Post by itsmani1 »

Hi

I want to make a spider so that it can go out and get information of sites like google's spider does? any idea?


thanks
mrkite
Forum Contributor
Posts: 104
Joined: Tue Sep 11, 2007 4:19 am

Post by mrkite »

Step 1. get a lot of diskspace
Step 2. write spider :)


Seriously though, the first step would to come up with a way of indexing the content on a single web page.

Spidering through all the links of a webpage gets difficult too because there are plenty of endless loops to run into. A good example is the Recent Changes page of mediawiki.

http://en.wikipedia.org/wiki/Special:Recentchanges

Even googlebot will get caught in an endless loop there. There are too many links, with unique urls that return basically the same content over and over again. You can't even md5sum the page because the urls change based on the url.

So you'll want to come up with a way to handle that. It's a pretty big undertaking.
Post Reply