site crawler for search script using cURL

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
User avatar
mudkicker
Forum Contributor
Posts: 479
Joined: Wed Jul 09, 2003 6:11 pm
Location: Istanbul, TR
Contact:

site crawler for search script using cURL

Post by mudkicker »

Hey people,
I am developing a search script which crawls sites in your web directory and inserts to a database for further searching.

At first, I used some php tags. I got file contents to a string strip tags and insert to database, but it isn't effiecient, because i think it doesn't give the right output..

Then I thought to use cURL to open and crawl files...

What do you think about this, what are your suggestions?

The hardest part of this thing is to crawl the sites/pages in a very logical and efficient way.. :roll: :?: :idea:
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

curl is easy to use but isn't very efficient. It's a right tool for a simple little web-spider.
User avatar
mudkicker
Forum Contributor
Posts: 479
Joined: Wed Jul 09, 2003 6:11 pm
Location: Istanbul, TR
Contact:

Post by mudkicker »

well at first i am wrtinig this script for use in your site only. so it's not important to crawl other files in whole web.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

file_get_contents should work just fine when needing to crawl the local files. Unless you need the header information sent back to the browser.. That depends on what level of crawling you really want to do. Like some sites will need a properly set session cookie or other things to properly index. So curl may be the only option, outside of writing your own version of it (with some logic changes); which in the end may be a good idea, if not for learning alone.

As for the searching itself... I wouldn't store the content completely, per se. But instead store the individual (a new record for each) into the database. You'll need a filter to remove common and useless words for each language of course. You can then have a second table with the urls and timestamps when you last crawled them. A third table would be needed to provide content linkage between the two tables. Table a and c will be HUGE, but searches will probably be faster for it..

when recrawling a page, I'd delete/replace their entries in the two big tables.
User avatar
mudkicker
Forum Contributor
Posts: 479
Joined: Wed Jul 09, 2003 6:11 pm
Location: Istanbul, TR
Contact:

Post by mudkicker »

well, i think file_Get_contents is better for local use..
whatever some other problem is showin up.
without strippin tags it searches everything even inside php tags. but i want to make my script search what html outputs like we see it with a browser...
how can I do this?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

that's basically what strip_tags is for, however you may need to create your own version that seperates all the stripped tags out so you can index their words correctly.
User avatar
twigletmac
Her Royal Site Adminness
Posts: 5371
Joined: Tue Apr 23, 2002 2:21 am
Location: Essex, UK

Post by twigletmac »

You would need to run a file to see exactly what it's output is - so you may want to crawl via URL rather than by filesystem and let the webserver do all the processing work.

Mac
User avatar
mudkicker
Forum Contributor
Posts: 479
Joined: Wed Jul 09, 2003 6:11 pm
Location: Istanbul, TR
Contact:

Post by mudkicker »

twigletmac wrote:You would need to run a file to see exactly what it's output is - so you may want to crawl via URL rather than by filesystem and let the webserver do all the processing work.

Mac
this is the idea, but how. I can't imageine how to do this twiglet, this is the problem :?: :roll: :(
User avatar
twigletmac
Her Royal Site Adminness
Posts: 5371
Joined: Tue Apr 23, 2002 2:21 am
Location: Essex, UK

Post by twigletmac »

You would need to get the contents of the homepage, scan it for links and then follow each of these repeating the same procedure. There's probably quite a few projects like this in PHP on the web so a search would probably help you get started in the right direction.

Mac
User avatar
mudkicker
Forum Contributor
Posts: 479
Joined: Wed Jul 09, 2003 6:11 pm
Location: Istanbul, TR
Contact:

Post by mudkicker »

yes i found some like phpMySearch.
this is not the problem..
i can make a spider robot which scans and crawls links. but the main problem is how should i crawl pages and add to database? (adding the content of the pages in a sensible way)..
Post Reply