Crawler

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
thatsme
Forum Commoner
Posts: 87
Joined: Sat Apr 07, 2007 2:18 am

Crawler

Post by thatsme »

Hi,

I am trying to index a website. I am struck. Can someone help me out getting this right?

Code: Select all

 
   $ulr = 'http://localhost/my_search/';
   $first_page_links = Links($ulr);
 
   foreach($first_page_links as $first_page_link)
  {
         $all_links[] = Links($first_page_link)
         return $all_links;
  }
 
 
i am not getting all the links.

Code: Select all

 
   function Links($url)
   {
       $fc = file_get_contents($url);
       preg_match_all('/\s+href\s*=\s*[\"\']?([^\s\"\']+)[\"\'\s]+/ims', $fc, $links);
 
      return $links[1];
   }
 
Thanks
thatsme
Forum Commoner
Posts: 87
Joined: Sat Apr 07, 2007 2:18 am

Re: Crawler

Post by thatsme »

This forum is dead!
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Re: Crawler

Post by onion2k »

No it's not, it's just quiet on weekends.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Crawler

Post by requinix »

There's no sort of recursion in there. In fact, all it does is get the links from the first page linked to in /my_search.

Let me say right now, it'd probably be better for you to use an existing tool for this. Don't know any off the top of my head, but that's why things like Google exist.

1) You need to keep track of which pages you've scanned already. Otherwise you'll get stuck in an infinite loop by scanning the same pages over and over again.
2) There needs to be some sort of restriction. Like to stay inside that one site, or to only go X links in. I assume you want the former.
3) Like I said, you also need recursion. Make a function to return a list of all links in a page (you have that already). Then make another function that takes a page and collects links from pages it links to - this function will call itself.
4) Remember that links have three (common) basic forms: with protocol (http://example.com/path/file.php), without and absolute (/path/file.php), and without and relative (../path/file.php). Each form needs to be handled a little differently.
Post Reply