Page 1 of 1

About a Crawling Project

Posted: Fri May 28, 2010 1:33 pm
by miniramen
hello!!
I want to keep this less reading as possible, but basically work has asked to just do something that is obviously out of my league, but I'm ambitious and
I want to get it done.

Basically I've been given a list of business directories for me to crawl. I have to crawl their business names, phone ect. Also, the crawler has to
crawl the entire website, not only a webpage inside it without going to external links.

so fine... I try to meet this problem step by step.
First, I need a way to crawl an entire website, then extract information of it for each site. The information that I'll be getting will be input onto a mysql database.

So my thoughts would be that the crawler will do the same thing for each website within a webpage.
For each, it will try to find out the business name through some keywords and if there are labels that categorizes the data.
Then store it inside the database, if null..then just skip the field. That's really what it has to do.

Unfortunately the pattern is not easy to find out, neither is the way to crawl an entire website.

So in the technical aspect.

I'll create a dom document for each site_content, taking only the important tagsusing Xpath then try to use Regex to query from it and insert the data onto a multidimensional array.
I really wonder if it'd work, but it would be amazing if I can do this!!!

If anyone has done anything similar in the past, please give me some tips, so I can better understand from it.

Re: About a Crawling Project

Posted: Fri May 28, 2010 1:48 pm
by Eran
A couple of pointers -
1. The Zend_Dom_Query component from the Zend Framework is very useful for crawling. It can be used as a standalone, and uses CSS syntax to find elements (much like jQuery if you are familiar with it). http://framework.zend.com/manual/en/zend.dom.query.html
2. If you intend to scrape a lot of pages, you should consider using the curl_multi functions which can fetch several pages at the same time. This will significantly reduce the time it takes to complete the crawl. http://www.php.net/manual/en/function.c ... i-exec.php
3. Having said that, you should consider throttling your own rate of crawling as some sites will block your IP if you fetch pages too fast, as protection against DOS attacks.

Re: About a Crawling Project

Posted: Fri May 28, 2010 1:49 pm
by phdatabase
I make a living doing just this. The book that set me on this path is Webbots, Spiders, and Screen Scrapers a guide to developing Internet Agents with PHP/CURL. It has working (kind of) examples. $19.98 used on Amazon.

Re: About a Crawling Project

Posted: Mon May 31, 2010 12:39 pm
by miniramen
Thanks phdatabase, I got the book and it has nice ressources, but since this project of mine is so complicated, it felt like I'm still extremely far away from anythign to be done.

As for the zend dom query, I have troubles with installation, we do not need to dl the whole zend framework for this right?

So let me add another question, does anyone know any existent way to query all the url and url content of a website built by many webpages? I have something I found on wikipedia, but it's so complicated that I had trouble understanding it. http://syntax.cwarn23.net/PHP/Making_a_search_engine

Re: About a Crawling Project

Posted: Mon May 31, 2010 4:14 pm
by phdatabase
It's pretty straight forward and builds surprisingly fast. First set up your HTTP structure and this will depend on how stealthy you need to be. (Already done if you have the book LIB_http.php) Then write a function to strip out all the anchor and area links from the page and throw out any links that don't point to the current domain. If you throw links into an array you can easily get rid of multiples of any given link. Then make this function recursive to follow all the links you collect and you're done.