Hi
I want to make a spider so that it can go out and get information of sites like google's spider does? any idea?
thanks
spider idea
Moderator: General Moderators
Step 1. get a lot of diskspace
Step 2. write spider
Seriously though, the first step would to come up with a way of indexing the content on a single web page.
Spidering through all the links of a webpage gets difficult too because there are plenty of endless loops to run into. A good example is the Recent Changes page of mediawiki.
http://en.wikipedia.org/wiki/Special:Recentchanges
Even googlebot will get caught in an endless loop there. There are too many links, with unique urls that return basically the same content over and over again. You can't even md5sum the page because the urls change based on the url.
So you'll want to come up with a way to handle that. It's a pretty big undertaking.
Step 2. write spider
Seriously though, the first step would to come up with a way of indexing the content on a single web page.
Spidering through all the links of a webpage gets difficult too because there are plenty of endless loops to run into. A good example is the Recent Changes page of mediawiki.
http://en.wikipedia.org/wiki/Special:Recentchanges
Even googlebot will get caught in an endless loop there. There are too many links, with unique urls that return basically the same content over and over again. You can't even md5sum the page because the urls change based on the url.
So you'll want to come up with a way to handle that. It's a pretty big undertaking.