how can I spider my website and visit EVERY single link INT

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
Deseree
Forum Commoner
Posts: 84
Joined: Mon Feb 13, 2006 11:35 pm

how can I spider my website and visit EVERY single link INT

Post by Deseree »

Hi,
does anyone have a PHP script or know of one I can use to spider every single url on my website and visit every page, grabbing all INTERNAL links and visiting?

Thanks ...
User avatar
Burrito
Spockulator
Posts: 4715
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Post by Burrito »

look at file_get_contents() and preg_match_all() you should.

Need to create a regular expression pattern to parse out the links you will.

Once the links you have, create a recursive function to drill into them you can.
Deseree
Forum Commoner
Posts: 84
Joined: Mon Feb 13, 2006 11:35 pm

Post by Deseree »

Burrito wrote:look at file_get_contents() and preg_match_all() you should.

Need to create a regular expression pattern to parse out the links you will.

Once the links you have, create a recursive function to drill into them you can.
ok but preg_match_all only does one file, i can grab all internal links on one webpage, how would i recurse my website which is 45,000 pages EFFICIENTLY? ( not calling the same link twice ). ( storing in a mysql db is a bit ecentrinc and would take a bit more coding that I plan to do this with...
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

Not calling the same link twice is easy, just put the ones you've already visited into an array and use array_search() to check if you've been to the page already. 45,000 array elements is small enough for PHP to cope with if you've increased the memory limit a bit.

Question though.. why are you doing this? Surely you must know the structure of your own website, which means you should be able to write a script that generates the page names without actually having to go to the pages and parse them for anchors..
Deseree
Forum Commoner
Posts: 84
Joined: Mon Feb 13, 2006 11:35 pm

Post by Deseree »

onion2k wrote:Not calling the same link twice is easy, just put the ones you've already visited into an array and use array_search() to check if you've been to the page already. 45,000 array elements is small enough for PHP to cope with if you've increased the memory limit a bit.

Question though.. why are you doing this? Surely you must know the structure of your own website, which means you should be able to write a script that generates the page names without actually having to go to the pages and parse them for anchors..
sadly no I don't have the site layout for the sites I'm talking about/in question, only my most recent stuff am I smart enough to keep that.
Post Reply