Page 1 of 1

site crawler for search script using cURL

Posted: Fri Oct 08, 2004 9:40 am
by mudkicker
Hey people,
I am developing a search script which crawls sites in your web directory and inserts to a database for further searching.

At first, I used some php tags. I got file contents to a string strip tags and insert to database, but it isn't effiecient, because i think it doesn't give the right output..

Then I thought to use cURL to open and crawl files...

What do you think about this, what are your suggestions?

The hardest part of this thing is to crawl the sites/pages in a very logical and efficient way.. :roll: :?: :idea:

Posted: Fri Oct 08, 2004 10:17 am
by Weirdan
curl is easy to use but isn't very efficient. It's a right tool for a simple little web-spider.

Posted: Fri Oct 08, 2004 10:21 am
by mudkicker
well at first i am wrtinig this script for use in your site only. so it's not important to crawl other files in whole web.

Posted: Fri Oct 08, 2004 10:42 am
by feyd
file_get_contents should work just fine when needing to crawl the local files. Unless you need the header information sent back to the browser.. That depends on what level of crawling you really want to do. Like some sites will need a properly set session cookie or other things to properly index. So curl may be the only option, outside of writing your own version of it (with some logic changes); which in the end may be a good idea, if not for learning alone.

As for the searching itself... I wouldn't store the content completely, per se. But instead store the individual (a new record for each) into the database. You'll need a filter to remove common and useless words for each language of course. You can then have a second table with the urls and timestamps when you last crawled them. A third table would be needed to provide content linkage between the two tables. Table a and c will be HUGE, but searches will probably be faster for it..

when recrawling a page, I'd delete/replace their entries in the two big tables.

Posted: Fri Oct 08, 2004 11:34 am
by mudkicker
well, i think file_Get_contents is better for local use..
whatever some other problem is showin up.
without strippin tags it searches everything even inside php tags. but i want to make my script search what html outputs like we see it with a browser...
how can I do this?

Posted: Fri Oct 08, 2004 11:35 am
by feyd
that's basically what strip_tags is for, however you may need to create your own version that seperates all the stripped tags out so you can index their words correctly.

Posted: Sat Oct 09, 2004 5:38 am
by twigletmac
You would need to run a file to see exactly what it's output is - so you may want to crawl via URL rather than by filesystem and let the webserver do all the processing work.

Mac

Posted: Sat Oct 09, 2004 5:53 am
by mudkicker
twigletmac wrote:You would need to run a file to see exactly what it's output is - so you may want to crawl via URL rather than by filesystem and let the webserver do all the processing work.

Mac
this is the idea, but how. I can't imageine how to do this twiglet, this is the problem :?: :roll: :(

Posted: Sat Oct 09, 2004 5:58 am
by twigletmac
You would need to get the contents of the homepage, scan it for links and then follow each of these repeating the same procedure. There's probably quite a few projects like this in PHP on the web so a search would probably help you get started in the right direction.

Mac

Posted: Sat Oct 09, 2004 6:13 am
by mudkicker
yes i found some like phpMySearch.
this is not the problem..
i can make a spider robot which scans and crawls links. but the main problem is how should i crawl pages and add to database? (adding the content of the pages in a sensible way)..