Hi,
does anyone have a PHP script or know of one I can use to spider every single url on my website and visit every page, grabbing all INTERNAL links and visiting?
Thanks ...
how can I spider my website and visit EVERY single link INT
Moderator: General Moderators
look at file_get_contents() and preg_match_all() you should.
Need to create a regular expression pattern to parse out the links you will.
Once the links you have, create a recursive function to drill into them you can.
Need to create a regular expression pattern to parse out the links you will.
Once the links you have, create a recursive function to drill into them you can.
ok but preg_match_all only does one file, i can grab all internal links on one webpage, how would i recurse my website which is 45,000 pages EFFICIENTLY? ( not calling the same link twice ). ( storing in a mysql db is a bit ecentrinc and would take a bit more coding that I plan to do this with...Burrito wrote:look at file_get_contents() and preg_match_all() you should.
Need to create a regular expression pattern to parse out the links you will.
Once the links you have, create a recursive function to drill into them you can.
Not calling the same link twice is easy, just put the ones you've already visited into an array and use array_search() to check if you've been to the page already. 45,000 array elements is small enough for PHP to cope with if you've increased the memory limit a bit.
Question though.. why are you doing this? Surely you must know the structure of your own website, which means you should be able to write a script that generates the page names without actually having to go to the pages and parse them for anchors..
Question though.. why are you doing this? Surely you must know the structure of your own website, which means you should be able to write a script that generates the page names without actually having to go to the pages and parse them for anchors..
sadly no I don't have the site layout for the sites I'm talking about/in question, only my most recent stuff am I smart enough to keep that.onion2k wrote:Not calling the same link twice is easy, just put the ones you've already visited into an array and use array_search() to check if you've been to the page already. 45,000 array elements is small enough for PHP to cope with if you've increased the memory limit a bit.
Question though.. why are you doing this? Surely you must know the structure of your own website, which means you should be able to write a script that generates the page names without actually having to go to the pages and parse them for anchors..