Page 1 of 1

how can I spider my website and visit EVERY single link INT

Posted: Wed May 17, 2006 11:50 pm
by Deseree
Hi,
does anyone have a PHP script or know of one I can use to spider every single url on my website and visit every page, grabbing all INTERNAL links and visiting?

Thanks ...

Posted: Thu May 18, 2006 12:32 am
by Burrito
look at file_get_contents() and preg_match_all() you should.

Need to create a regular expression pattern to parse out the links you will.

Once the links you have, create a recursive function to drill into them you can.

Posted: Thu May 18, 2006 2:15 am
by Deseree
Burrito wrote:look at file_get_contents() and preg_match_all() you should.

Need to create a regular expression pattern to parse out the links you will.

Once the links you have, create a recursive function to drill into them you can.
ok but preg_match_all only does one file, i can grab all internal links on one webpage, how would i recurse my website which is 45,000 pages EFFICIENTLY? ( not calling the same link twice ). ( storing in a mysql db is a bit ecentrinc and would take a bit more coding that I plan to do this with...

Posted: Thu May 18, 2006 3:37 am
by onion2k
Not calling the same link twice is easy, just put the ones you've already visited into an array and use array_search() to check if you've been to the page already. 45,000 array elements is small enough for PHP to cope with if you've increased the memory limit a bit.

Question though.. why are you doing this? Surely you must know the structure of your own website, which means you should be able to write a script that generates the page names without actually having to go to the pages and parse them for anchors..

Posted: Thu May 18, 2006 6:45 am
by Deseree
onion2k wrote:Not calling the same link twice is easy, just put the ones you've already visited into an array and use array_search() to check if you've been to the page already. 45,000 array elements is small enough for PHP to cope with if you've increased the memory limit a bit.

Question though.. why are you doing this? Surely you must know the structure of your own website, which means you should be able to write a script that generates the page names without actually having to go to the pages and parse them for anchors..
sadly no I don't have the site layout for the sites I'm talking about/in question, only my most recent stuff am I smart enough to keep that.