Page 1 of 1

Crawler

Posted: Sat Oct 25, 2008 9:38 pm
by thatsme
Hi,

I am trying to index a website. I am struck. Can someone help me out getting this right?

Code: Select all

 
   $ulr = 'http://localhost/my_search/';
   $first_page_links = Links($ulr);
 
   foreach($first_page_links as $first_page_link)
  {
         $all_links[] = Links($first_page_link)
         return $all_links;
  }
 
 
i am not getting all the links.

Code: Select all

 
   function Links($url)
   {
       $fc = file_get_contents($url);
       preg_match_all('/\s+href\s*=\s*[\"\']?([^\s\"\']+)[\"\'\s]+/ims', $fc, $links);
 
      return $links[1];
   }
 
Thanks

Re: Crawler

Posted: Sun Oct 26, 2008 5:28 am
by thatsme
This forum is dead!

Re: Crawler

Posted: Sun Oct 26, 2008 5:39 am
by onion2k
No it's not, it's just quiet on weekends.

Re: Crawler

Posted: Sun Oct 26, 2008 6:16 am
by requinix
There's no sort of recursion in there. In fact, all it does is get the links from the first page linked to in /my_search.

Let me say right now, it'd probably be better for you to use an existing tool for this. Don't know any off the top of my head, but that's why things like Google exist.

1) You need to keep track of which pages you've scanned already. Otherwise you'll get stuck in an infinite loop by scanning the same pages over and over again.
2) There needs to be some sort of restriction. Like to stay inside that one site, or to only go X links in. I assume you want the former.
3) Like I said, you also need recursion. Make a function to return a list of all links in a page (you have that already). Then make another function that takes a page and collects links from pages it links to - this function will call itself.
4) Remember that links have three (common) basic forms: with protocol (http://example.com/path/file.php), without and absolute (/path/file.php), and without and relative (../path/file.php). Each form needs to be handled a little differently.