post-processor for screen scraper
Posted: Wed May 14, 2008 8:33 am
I am working on a generic framework for a screen scraper.
The idea is to have a database of various sites and a few attributes of that site which then allows some php code to grab the site scrape it. We are currently working the idea to harvest movie ratings from a large variety of places and to parse names from online directories (using 2 completely different types of data to test / prove the design).
What I need is a generic way to insert a "post-processor". In other words:
Step 1: retrieve all vars from the db
Step 2: get the web page
Step 3: parse the records (preg_match_all) and place in array
Step 4: convert each record to plain text
Step 5: parse each record for fields we want
Step 6: insert into db
I need to insert a command in Step 5a which is more generic in nature and should really be a regex statement. Its main purpose is to "re-arrange" the data into a format that the parse can more easily recognize. For example consider the following 3 records
In the first case I need to do nothing. In the 2nd case I need to re-arrange the rows (plain text seperated by newlines) so the phone # is last and in the 3rd case I need to delete the 2nd line.
So with all that my 2 questions:
1. I have no idea what the regex expressions could look like to do that. I know it needs to be something like
/(.*+)\n(.*+)\n(.*+)\n(.*+\n)/$1\n$3\n$4\n$2\n
/(.*+)\n(.*+)\n(.*+)\n(.*+\n)/$1\n$3\n$4\n
2. what php function should I be using. preg_match... is not appropriate, neither is preg_replace as it requires the search and replace in different variables. I think I need a more generic regex statment something like regex('s/search/replace/gsi')
Thanks!
Peter
The idea is to have a database of various sites and a few attributes of that site which then allows some php code to grab the site scrape it. We are currently working the idea to harvest movie ratings from a large variety of places and to parse names from online directories (using 2 completely different types of data to test / prove the design).
What I need is a generic way to insert a "post-processor". In other words:
Step 1: retrieve all vars from the db
Step 2: get the web page
Step 3: parse the records (preg_match_all) and place in array
Step 4: convert each record to plain text
Step 5: parse each record for fields we want
Step 6: insert into db
I need to insert a command in Step 5a which is more generic in nature and should really be a regex statement. Its main purpose is to "re-arrange" the data into a format that the parse can more easily recognize. For example consider the following 3 records
Code: Select all
Peter Carlson Peter Carlson Peter Carlson
12345 Main St 111-222-3333 Extra Stuff Here
Mytown, AA, 00000 12345 Main St 12345 Main St
111-222-3333 Mytown, AA 0000 Mytown, AA 0000
So with all that my 2 questions:
1. I have no idea what the regex expressions could look like to do that. I know it needs to be something like
/(.*+)\n(.*+)\n(.*+)\n(.*+\n)/$1\n$3\n$4\n$2\n
/(.*+)\n(.*+)\n(.*+)\n(.*+\n)/$1\n$3\n$4\n
2. what php function should I be using. preg_match... is not appropriate, neither is preg_replace as it requires the search and replace in different variables. I think I need a more generic regex statment something like regex('s/search/replace/gsi')
Thanks!
Peter