About a Crawling Project

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
miniramen
Forum Newbie
Posts: 5
Joined: Wed May 26, 2010 11:45 am

About a Crawling Project

Post by miniramen »

hello!!
I want to keep this less reading as possible, but basically work has asked to just do something that is obviously out of my league, but I'm ambitious and
I want to get it done.

Basically I've been given a list of business directories for me to crawl. I have to crawl their business names, phone ect. Also, the crawler has to
crawl the entire website, not only a webpage inside it without going to external links.

so fine... I try to meet this problem step by step.
First, I need a way to crawl an entire website, then extract information of it for each site. The information that I'll be getting will be input onto a mysql database.

So my thoughts would be that the crawler will do the same thing for each website within a webpage.
For each, it will try to find out the business name through some keywords and if there are labels that categorizes the data.
Then store it inside the database, if null..then just skip the field. That's really what it has to do.

Unfortunately the pattern is not easy to find out, neither is the way to crawl an entire website.

So in the technical aspect.

I'll create a dom document for each site_content, taking only the important tagsusing Xpath then try to use Regex to query from it and insert the data onto a multidimensional array.
I really wonder if it'd work, but it would be amazing if I can do this!!!

If anyone has done anything similar in the past, please give me some tips, so I can better understand from it.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: About a Crawling Project

Post by Eran »

A couple of pointers -
1. The Zend_Dom_Query component from the Zend Framework is very useful for crawling. It can be used as a standalone, and uses CSS syntax to find elements (much like jQuery if you are familiar with it). http://framework.zend.com/manual/en/zend.dom.query.html
2. If you intend to scrape a lot of pages, you should consider using the curl_multi functions which can fetch several pages at the same time. This will significantly reduce the time it takes to complete the crawl. http://www.php.net/manual/en/function.c ... i-exec.php
3. Having said that, you should consider throttling your own rate of crawling as some sites will block your IP if you fetch pages too fast, as protection against DOS attacks.
User avatar
phdatabase
Forum Commoner
Posts: 83
Joined: Fri May 28, 2010 10:02 am
Location: Fort Myers, FL

Re: About a Crawling Project

Post by phdatabase »

I make a living doing just this. The book that set me on this path is Webbots, Spiders, and Screen Scrapers a guide to developing Internet Agents with PHP/CURL. It has working (kind of) examples. $19.98 used on Amazon.
miniramen
Forum Newbie
Posts: 5
Joined: Wed May 26, 2010 11:45 am

Re: About a Crawling Project

Post by miniramen »

Thanks phdatabase, I got the book and it has nice ressources, but since this project of mine is so complicated, it felt like I'm still extremely far away from anythign to be done.

As for the zend dom query, I have troubles with installation, we do not need to dl the whole zend framework for this right?

So let me add another question, does anyone know any existent way to query all the url and url content of a website built by many webpages? I have something I found on wikipedia, but it's so complicated that I had trouble understanding it. http://syntax.cwarn23.net/PHP/Making_a_search_engine
User avatar
phdatabase
Forum Commoner
Posts: 83
Joined: Fri May 28, 2010 10:02 am
Location: Fort Myers, FL

Re: About a Crawling Project

Post by phdatabase »

It's pretty straight forward and builds surprisingly fast. First set up your HTTP structure and this will depend on how stealthy you need to be. (Already done if you have the book LIB_http.php) Then write a function to strip out all the anchor and area links from the page and throw out any links that don't point to the current domain. If you throw links into an array you can easily get rid of multiples of any given link. Then make this function recursive to follow all the links you collect and you're done.
Post Reply