About a Crawling Project
Posted: Fri May 28, 2010 1:33 pm
hello!!
I want to keep this less reading as possible, but basically work has asked to just do something that is obviously out of my league, but I'm ambitious and
I want to get it done.
Basically I've been given a list of business directories for me to crawl. I have to crawl their business names, phone ect. Also, the crawler has to
crawl the entire website, not only a webpage inside it without going to external links.
so fine... I try to meet this problem step by step.
First, I need a way to crawl an entire website, then extract information of it for each site. The information that I'll be getting will be input onto a mysql database.
So my thoughts would be that the crawler will do the same thing for each website within a webpage.
For each, it will try to find out the business name through some keywords and if there are labels that categorizes the data.
Then store it inside the database, if null..then just skip the field. That's really what it has to do.
Unfortunately the pattern is not easy to find out, neither is the way to crawl an entire website.
So in the technical aspect.
I'll create a dom document for each site_content, taking only the important tagsusing Xpath then try to use Regex to query from it and insert the data onto a multidimensional array.
I really wonder if it'd work, but it would be amazing if I can do this!!!
If anyone has done anything similar in the past, please give me some tips, so I can better understand from it.
I want to keep this less reading as possible, but basically work has asked to just do something that is obviously out of my league, but I'm ambitious and
I want to get it done.
Basically I've been given a list of business directories for me to crawl. I have to crawl their business names, phone ect. Also, the crawler has to
crawl the entire website, not only a webpage inside it without going to external links.
so fine... I try to meet this problem step by step.
First, I need a way to crawl an entire website, then extract information of it for each site. The information that I'll be getting will be input onto a mysql database.
So my thoughts would be that the crawler will do the same thing for each website within a webpage.
For each, it will try to find out the business name through some keywords and if there are labels that categorizes the data.
Then store it inside the database, if null..then just skip the field. That's really what it has to do.
Unfortunately the pattern is not easy to find out, neither is the way to crawl an entire website.
So in the technical aspect.
I'll create a dom document for each site_content, taking only the important tagsusing Xpath then try to use Regex to query from it and insert the data onto a multidimensional array.
I really wonder if it'd work, but it would be amazing if I can do this!!!
If anyone has done anything similar in the past, please give me some tips, so I can better understand from it.