It turns out that I'll be doing some work to create a website scraper for a Realtor friend of mine. This will be used to increase other fellow realtors in that it will increase awareness through exposure.
Scraper -
1.) Upon initial load, the application will visit the specified URLs (variable / array).
2.) Uses regex to scan for a certain pattern that would match some sort of pertinent markup indicating realty.
3.) Upon successful match, the data is then stored (into an array and then into a database).
4.) The harvested information will then reside in the database following this generic format:
____________________________________________________________
|ID |PRICE |STATE |COUNTY |CITY |ADDRESS |ETC... |
---------------------------------------------------------------------
5.) When needed, the appropriate queries will display the information in according to whatever search is specified in a query string, otherwise, it will simply default on the normal "standard page" as being listed as "recent postings" or something...
So far, do I have this generally accurate in terms of approach? I'm not sure how long a project like this should take, but with a minor CMS backend to add websites into the first array (which obviously includes possible security concerns, etc.), I'm thinking that it should be no less than a month since I'm pretty new to some of this.
If you've done things like this before, I would appreciate any professional input. I won't go as far as thinking I'm completely lost, but I do feel a wee bit intimidated--mainly because this will be my first "from the ground, up" project.
Website "scraper" question...
Moderator: General Moderators
Re: Website "scraper" question...
Hello there..
I have done a similiar programming script for getting some information from a site. I have done it in perl though.
Guessing by the way u have described all that it needs is a curl execution with content parsing and database insertion.
Well to be frank, my script took only half a day to code and of course another half to just execute( there were about 1000 urls to be parsed).
So chill. it doesnt take much time once u have the urls to be parsed ready( or atleast the urls pattern).
Coming to the next part, searching from the database that u already have and using the fields is just a matter of constructing the correct query from the inputs and giving out details(with pagination i might add).
I suggest if u start from scrap, it will take about a week for u to completely deliver ur project
I have done a similiar programming script for getting some information from a site. I have done it in perl though.
Guessing by the way u have described all that it needs is a curl execution with content parsing and database insertion.
Well to be frank, my script took only half a day to code and of course another half to just execute( there were about 1000 urls to be parsed).
So chill. it doesnt take much time once u have the urls to be parsed ready( or atleast the urls pattern).
Coming to the next part, searching from the database that u already have and using the fields is just a matter of constructing the correct query from the inputs and giving out details(with pagination i might add).
I suggest if u start from scrap, it will take about a week for u to completely deliver ur project
Re: Website "scraper" question...
You wrote something that was capable of accurately scraping 1000 different websites in 1/2 a day?susrisha wrote:Well to be frank, my script took only half a day to code
Or do you mean "1000 urls" in the sense that it was 1000 pages of the same site? In which case it wouldn't be much use for aggregating lots of different website's content into a single page.
Re: Website "scraper" question...
I wrote it for a single site . It had 1000 urls in the same site. The site was holding data that belongs to 1000 other different sites. 