Website "scraper" question...
Posted: Tue Jun 16, 2009 9:03 pm
It turns out that I'll be doing some work to create a website scraper for a Realtor friend of mine. This will be used to increase other fellow realtors in that it will increase awareness through exposure.
Scraper -
1.) Upon initial load, the application will visit the specified URLs (variable / array).
2.) Uses regex to scan for a certain pattern that would match some sort of pertinent markup indicating realty.
3.) Upon successful match, the data is then stored (into an array and then into a database).
4.) The harvested information will then reside in the database following this generic format:
____________________________________________________________
|ID |PRICE |STATE |COUNTY |CITY |ADDRESS |ETC... |
---------------------------------------------------------------------
5.) When needed, the appropriate queries will display the information in according to whatever search is specified in a query string, otherwise, it will simply default on the normal "standard page" as being listed as "recent postings" or something...
So far, do I have this generally accurate in terms of approach? I'm not sure how long a project like this should take, but with a minor CMS backend to add websites into the first array (which obviously includes possible security concerns, etc.), I'm thinking that it should be no less than a month since I'm pretty new to some of this.
If you've done things like this before, I would appreciate any professional input. I won't go as far as thinking I'm completely lost, but I do feel a wee bit intimidated--mainly because this will be my first "from the ground, up" project.
Scraper -
1.) Upon initial load, the application will visit the specified URLs (variable / array).
2.) Uses regex to scan for a certain pattern that would match some sort of pertinent markup indicating realty.
3.) Upon successful match, the data is then stored (into an array and then into a database).
4.) The harvested information will then reside in the database following this generic format:
____________________________________________________________
|ID |PRICE |STATE |COUNTY |CITY |ADDRESS |ETC... |
---------------------------------------------------------------------
5.) When needed, the appropriate queries will display the information in according to whatever search is specified in a query string, otherwise, it will simply default on the normal "standard page" as being listed as "recent postings" or something...
So far, do I have this generally accurate in terms of approach? I'm not sure how long a project like this should take, but with a minor CMS backend to add websites into the first array (which obviously includes possible security concerns, etc.), I'm thinking that it should be no less than a month since I'm pretty new to some of this.
If you've done things like this before, I would appreciate any professional input. I won't go as far as thinking I'm completely lost, but I do feel a wee bit intimidated--mainly because this will be my first "from the ground, up" project.