Page 1 of 1

spider idea

Posted: Tue Oct 09, 2007 4:18 pm
by itsmani1
Hi

I want to make a spider so that it can go out and get information of sites like google's spider does? any idea?


thanks

Posted: Tue Oct 09, 2007 9:07 pm
by mrkite
Step 1. get a lot of diskspace
Step 2. write spider :)


Seriously though, the first step would to come up with a way of indexing the content on a single web page.

Spidering through all the links of a webpage gets difficult too because there are plenty of endless loops to run into. A good example is the Recent Changes page of mediawiki.

http://en.wikipedia.org/wiki/Special:Recentchanges

Even googlebot will get caught in an endless loop there. There are too many links, with unique urls that return basically the same content over and over again. You can't even md5sum the page because the urls change based on the url.

So you'll want to come up with a way to handle that. It's a pretty big undertaking.