post-processor for screen scraper

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
howudodat
Forum Newbie
Posts: 4
Joined: Wed May 14, 2008 7:54 am

post-processor for screen scraper

Post by howudodat »

I am working on a generic framework for a screen scraper.
The idea is to have a database of various sites and a few attributes of that site which then allows some php code to grab the site scrape it. We are currently working the idea to harvest movie ratings from a large variety of places and to parse names from online directories (using 2 completely different types of data to test / prove the design).

What I need is a generic way to insert a "post-processor". In other words:
Step 1: retrieve all vars from the db
Step 2: get the web page
Step 3: parse the records (preg_match_all) and place in array
Step 4: convert each record to plain text
Step 5: parse each record for fields we want
Step 6: insert into db

I need to insert a command in Step 5a which is more generic in nature and should really be a regex statement. Its main purpose is to "re-arrange" the data into a format that the parse can more easily recognize. For example consider the following 3 records

Code: Select all

 
Peter Carlson           Peter Carlson           Peter Carlson
12345 Main St           111-222-3333            Extra Stuff Here
Mytown, AA, 00000       12345 Main St           12345 Main St
111-222-3333            Mytown, AA 0000         Mytown, AA 0000  
 
In the first case I need to do nothing. In the 2nd case I need to re-arrange the rows (plain text seperated by newlines) so the phone # is last and in the 3rd case I need to delete the 2nd line.

So with all that my 2 questions:
1. I have no idea what the regex expressions could look like to do that. I know it needs to be something like
/(.*+)\n(.*+)\n(.*+)\n(.*+\n)/$1\n$3\n$4\n$2\n
/(.*+)\n(.*+)\n(.*+)\n(.*+\n)/$1\n$3\n$4\n

2. what php function should I be using. preg_match... is not appropriate, neither is preg_replace as it requires the search and replace in different variables. I think I need a more generic regex statment something like regex('s/search/replace/gsi')

Thanks!
Peter
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: post-processor for screen scraper

Post by prometheuzz »

Couple of questions:

How do you get the data? Like you posted it: different records sharing the same line? (that can get rather messy!) Or record per record on 4 different lines?

Are the records always more or less the same? Four records whose position may vary?

How do you make a distinction between the lines: "John Paul Howard" and "Extra Stuff Here"? They look "regex-wise" the same: both are three words starting with a capital letter followed by lower case letters.
howudodat
Forum Newbie
Posts: 4
Joined: Wed May 14, 2008 7:54 am

Re: post-processor for screen scraper

Post by howudodat »

I placed the 3 records horizontally to provide a cleaner visual separation. The records are always:
Line1
Line2
Line3
Line4
And are formed with a preg_match_all

Code: Select all

       preg_match_all($search_info->s_records, $all, $temp_array);
        if (count($temp_array) <= 1) { $results .= "Unable to parse records with " . $search_info->lasturl . "\n"; return; }
        $records = $temp_array[1];
        while ($ctr < count($records)) {
           // convert html to text
           // parse text
        }
 
Peter
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: post-processor for screen scraper

Post by prometheuzz »

Instead of matching all at once, try matching each field in a separate match:

Code: Select all

#!/usr/bin/php
<?php
$tests = array(
"Peter Carlson        
12345 Main St                       
Mytown, AA, 00000                 
111-222-3333"
,
"Peter Carlson
111-222-3333
12345 Main St 
Mytown, AA 0000"
,
"Peter Carlson
Extra Stuff Here
12345 Main St
Mytown, AA 0000"
);
 
foreach($tests as $t) {
    $name = ""; $street = ""; $city = ""; $telephone = "";
    if(preg_match('/^([^\n]+)/', $t, $match)) $name = $match[1];
    if(preg_match('/(\d+\s+[a-zA-Z]+(?:[^\n]+[a-zA-Z]+)+)/', $t, $match)) $street = $match[1];
    if(preg_match('/([a-zA-Z]+,\s+[A-Z]{2},?\s+\d+)/', $t, $match)) $city = $match[1];
    if(preg_match('/(\d+-\d+-\d+)/', $t, $match)) $telephone = $match[1];
    print "\n$name\n$street\n$city\n$telephone\n";
}
/* output:
            Peter Carlson        
            12345 Main St
            Mytown, AA, 00000
            111-222-3333
 
            Peter Carlson
            12345 Main St
            Mytown, AA 0000
            111-222-3333
 
            Peter Carlson
            12345 Main St
            Mytown, AA 0000
*/
?>
howudodat
Forum Newbie
Posts: 4
Joined: Wed May 14, 2008 7:54 am

Re: post-processor for screen scraper

Post by howudodat »

I see what you are doing and it makes sense, however the phone # may not always be NPA-NXX-NNNN it might be (NPA) NXX-NNNN
The city, state, zip sometimes are missing one of the 3 variables, address lines may or may not contain suite#'s.

So if i can place the rows in a fixed order, I can than parse a row with multiple regex statements to match various formats peculiar to that expected line.

Peter
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: post-processor for screen scraper

Post by prometheuzz »

howudodat wrote:I see what you are doing and it makes sense, however the phone # may not always be NPA-NXX-NNNN it might be (NPA) NXX-NNNN
...
That can be overcome by extending the example I posted:

Code: Select all

'/\d{3}-\d{3}-\d{4}/'      // matches: NPA-NXX-NNNN
'/\(\d{3}\)\s\d{3}-\d{4}/' // matches: (NPA) NXX-NNNN
observe that both have \d{3}-\d{4} in them, so your telephone regex might look like:

Code: Select all

'/(\d{3}-|\(\d{3}\)\s)\d{3}-\d{4}/' // matches: NPA-NXX-NNNN or (NPA) NXX-NNNN
In normal English, the final regex would read as:

Code: Select all

(   
  \d{3}-        // three digits followed by a hyphen
  |             // OR
  \(\d{3}\)\s   // a '(' then three digits and a ')' followed by a white space character
)
\d{3}-\d{4}     // and ending with three digits, a hyphen and 4 digits
howudodat wrote:So if i can place the rows in a fixed order, I can than parse a row with multiple regex statements to match various formats peculiar to that expected line.

Peter
I presume you can make some progress now. If you run into problems, feel free to ask a specific question about it.

Good luck.
howudodat
Forum Newbie
Posts: 4
Joined: Wed May 14, 2008 7:54 am

Re: post-processor for screen scraper

Post by howudodat »

ok I will play with what you have given.
Just as a final question on this:
in perl I can do something like
$str =~ s/<a[ ]+.*?>(.+)<\/a>/$1/ig;
(please note I dont need something that deletes hyperlinks, I'm just using it as an example)
Is it possible to do the same thing in php using only one variable to store the entire regex

Peter
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: post-processor for screen scraper

Post by prometheuzz »

howudodat wrote:ok I will play with what you have given.
Just as a final question on this:
in perl I can do something like
$str =~ s/<a[ ]+.*?>(.+)<\/a>/$1/ig;
(please note I dont need something that deletes hyperlinks, I'm just using it as an example)
Is it possible to do the same thing in php using only one variable to store the entire regex

Peter
PHP's equivalent would be:

Code: Select all

$str = preg_replace('/<a[ ]+.*?>(.+)<\/a>/i', "$1", $str);
Note that the g modifier (global) is not necessary in PHP (it is by default).

I know it was just an example, but I would write the above as:

Code: Select all

$str = preg_replace('!<a\s+[^>]+>([^>]+)</a>!i', "$1", $str);
The '!' instead of the '/' will let me NOT escape the '/' (regexing html like data will have a lot of '/'s in them).
Also, the greedy dot matches (.+ or .*) should be something you should be careful of: you can get strange results and when working on large strings, it can cause your regex to perform poorly (speed). In this case, [^>]+ should be fine instead of .* or .+, and making it possessive [^>]++ is even better for performance, which is practically the same as "Atomic Grouping":
http://www.regular-expressions.info/atomic.html
Although that last part might be a bit too much info...
; )
Post Reply