preg_match_all stops in the middle of the string

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

preg_match_all stops in the middle of the string

Postby bigzero » Tue Jun 30, 2009 2:46 pm

Hello everyone. I was working on this project where I had to extract company names and addresses from a large text file and everything seemed to be working fine. But then I noticed that preg_match_all was "stopping" after it encountered a certain section in the file. If I remove that section, it continues to match about 600 more companies.

This is my regex:
Syntax: [ Download ] [ Hide ]
%^\\s{2}(?#company name)(\\S.*?)[ ]{3,}.*\\n\\s*(?#optional street address)(?:(\\S.*)\\s+/\\s+)?(?#city)([^/]+?),\\s*(?#state)(\\w+)\\s+(?#5 digit zip)(\\d+(?:-\\d+)?)\\s*$%m


Shorter, without comments:
Syntax: [ Download ] [ Hide ]
%^\\s{2}(\\S.*?)[ ]{3,}.*\\n\\s*(?:(\\S.*)\\s+/\\s+)?([^/]+?),\\s*(\\w+)\\s+(\\d+(?:-\\d+)?)\\s*$%m


Without double slashes for easier readability:
Syntax: [ Download ] [ Hide ]
%^\s{2}(\S.*?)[ ]{3,}.*\n\s*(?:(\S.*)\s+/\s+)?([^/]+?),\s*(\w+)\s+(\d+(?:-\d+)?)\s*$%m


I wrote a php script on my server to show what happens:
Syntax: [ Download ] [ Hide ]
<?php
$data = file_get_contents('test1.txt');
 
$regex = '%^\\s{2}(\\S.*?)[ ]{3,}.*\\n\\s*(?:(\\S.*)\\s+/\\s+)?([^/]+?),\\s*(\\w+)\\s+(\\d+(?:-\\d+)?)\\s*$%m';
 
echo htmlentities($regex);
echo '<br /><br />';
echo htmlentities($data);
echo '<br /><br />';
 
preg_match_all($regex,$data,$matches);
 
$matches[0] = null;
 
print_r($matches);
?>


The first script tries to parse the text file with the section of text that causes preg_match_all to stop (lines 3 through 28). The second script parses the text file without that section.

The way I understand regex and preg_match_all is that even if there is something in that section of text that does not match, or matches from there to the end of file for instance, it should still be matching the text after. (I hope that made sense). That's why I don't think there's anything wrong with the regex... but I could be wrong.

Can anyone explain this behavior?

Thank you in advance!
Attachments
regex_test.zip
text files and php scripts
(1.56 KiB) Downloaded 79 times
bigzero
Forum Newbie
 
Posts: 6
Joined: Tue Jun 30, 2009 2:26 pm

Re: preg_match_all stops in the middle of the string

Postby prometheuzz » Tue Jun 30, 2009 2:56 pm

bigzero wrote:... That's why I don't think there's anything wrong with the regex... but I could be wrong.


Well, there might not be something wrong with your regex, but if the output is unexpected, there's definitely something wrong: most probably with your understanding of regex. ; )

Anyway, given your example text:

Syntax: [ Download ] [ Hide ]
 PROMETHEAN BOOKS                                                                                                                    01/24/08
         NEW ORLEANS, LA  70119
 
         FOREIGN PARTNERSHIP REGISTRATIONS:                                                                                     REGISTERED
 
  CRAIN II OIL & GAS, LTD.                                                                                             36643740L  01/22/08
         DOMICILE:  TEXAS
  RS - HAMMOND, LA-1, L.P.                                                                                             36643720L  01/22/08
         DOMICILE: TEXAS
  WASKOM GAS PROCESSING COMPANY                                                                                        36641638L  01/18/08
         DOMICILE:  TEXAS
 
         AMENDMENTS TO FOREIGN PARTNERSHIPS:                                                                                         CHANGED ON
 
  IMTT-VIRGINIA                                                                                                                       01/22/08
         DOMICILE:  DELAWARE
         FROM: IMTT-CHESAPEAKE
  STANDARD AUTOMATION & CONTROL LP                                                                                                    01/18/08
         DOMICILE:  DELAWARE
         FROM: MTL OPEN SYSTEM TECHNOLOGIES LP
 
         TERMINATION OF FOREIGN PARTNERSHIPS:                                                                                           FILED
 
  A2D LP                                                                                                                              01/22/08
         DOMICILE:  TEXAS
 
         DOMESTIC LIMITED LIABILITY COMPANIES:                                                                                     FILED
 
  A AND A'S REAL ESTATE, LLC                                                                                           36645875K  01/24/08
         8224 LEE ST. / SORRENTO, LA 70778
  A BARKING DOG PRODUCTION, LLC                                                                                        36643781K  01/22/08
         584 BARBARA PLACE / MANDEVILLE, LA 70448
  A R BROOKS PAINT CONTRACTING, L.L.C.                                                                                 36647552K  01/24/08
         908 SOUTH 15TH STREET / MONROE, LA  71202


what part(s) of that text should be captured? And what parts get captured instead?
User avatar
prometheuzz
Forum Regular
 
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: preg_match_all stops in the middle of the string

Postby bigzero » Tue Jun 30, 2009 3:04 pm

The regex should capture lines 1-2; 29-30; 31-32; 33-34

In the test1.txt it only captures 1-2

In the test2.txt where lines 3-28 are removed, it captures everything as expected.
bigzero
Forum Newbie
 
Posts: 6
Joined: Tue Jun 30, 2009 2:26 pm

Re: preg_match_all stops in the middle of the string

Postby prometheuzz » Tue Jun 30, 2009 3:36 pm

bigzero wrote:The regex should capture lines 1-2; 29-30; 31-32; 33-34

In the test1.txt it only captures 1-2

In the test2.txt where lines 3-28 are removed, it captures everything as expected.


Okay.
I'm sorry to say but your regex is pretty much unreadable (incomprehensible) to me.
But, looking at the differences between the lines you want to match and the ones to ignore, the following occured to me:

You're interested in two successive lines that start with two white spaces and on the second line, end with five digits.

The above observation holds for the sample text you provided as you can see from the SSCCE:

Syntax: [ Download ] [ Hide ]
$text = "  PROMETHEAN BOOKS                                                                                                                    01/24/08
         NEW ORLEANS, LA  70119
 
         FOREIGN PARTNERSHIP REGISTRATIONS:                                                                                     REGISTERED
 
  CRAIN II OIL & GAS, LTD.                                                                                             36643740L  01/22/08
         DOMICILE:  TEXAS
  RS - HAMMOND, LA-1, L.P.                                                                                             36643720L  01/22/08
         DOMICILE: TEXAS
  WASKOM GAS PROCESSING COMPANY                                                                                        36641638L  01/18/08
         DOMICILE:  TEXAS
 
         AMENDMENTS TO FOREIGN PARTNERSHIPS:                                                                                         CHANGED ON
 
  IMTT-VIRGINIA                                                                                                                       01/22/08
         DOMICILE:  DELAWARE
         FROM: IMTT-CHESAPEAKE
  STANDARD AUTOMATION & CONTROL LP                                                                                                    01/18/08
         DOMICILE:  DELAWARE
         FROM: MTL OPEN SYSTEM TECHNOLOGIES LP
 
         TERMINATION OF FOREIGN PARTNERSHIPS:                                                                                           FILED
 
  A2D LP                                                                                                                              01/22/08
         DOMICILE:  TEXAS
 
         DOMESTIC LIMITED LIABILITY COMPANIES:                                                                                     FILED
 
  A AND A'S REAL ESTATE, LLC                                                                                           36645875K  01/24/08
         8224 LEE ST. / SORRENTO, LA 70778
  A BARKING DOG PRODUCTION, LLC                                                                                        36643781K  01/22/08
         584 BARBARA PLACE / MANDEVILLE, LA 70448
  A R BROOKS PAINT CONTRACTING, L.L.C.                                                                                 36647552K  01/24/08
         908 SOUTH 15TH STREET / MONROE, LA  71202"
;
         
preg_match_all('/^[ ]{2}.*\r?\n.*\d{5}$/m', $text, $matches);
 
print_r($matches);
 
/* matches the lines 1-2, 29-30, 31-32 and 33-34
 
Array
(
    [0] => Array
        (
            [0] =>   PROMETHEAN BOOKS                                                                                                                    01/24/08
         NEW ORLEANS, LA  70119
            [1] =>   A AND A'S REAL ESTATE, LLC                                                                                           36645875K  01/24/08
         8224 LEE ST. / SORRENTO, LA 70778
            [2] =>   A BARKING DOG PRODUCTION, LLC                                                                                        36643781K  01/22/08
         584 BARBARA PLACE / MANDEVILLE, LA 70448
            [3] =>   A R BROOKS PAINT CONTRACTING, L.L.C.                                                                                 36647552K  01/24/08
         908 SOUTH 15TH STREET / MONROE, LA  71202
        )
 
)
*/


If my observation is incorrect (ie. you over-simplified your example text), then please post a proper explanation (in plain English!) of what it exactly is you want to match since I am unable to untangle your regex. ; )

Best of luck!
User avatar
prometheuzz
Forum Regular
 
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: preg_match_all stops in the middle of the string

Postby bigzero » Tue Jun 30, 2009 3:55 pm

Alright, here's the breakdown:

The regex needs to match the following format

Syntax: [ Download ] [ Hide ]
^\s{2}(\S.*?)[ ]{3,}.*\n

Two spaces at the beginning of the line. Capture everything until at least 3 spaces are encountered. Skip everything until new line.

Syntax: [ Download ] [ Hide ]
\s*(?:(\S.*)\s+/\s+)?

Ignore white spaces at the beginning of the line. Then capture a street address if it is present. Street address ends with a forward slash ( / ), but there may be forward slashes inside the street address. It is, however, the last forward slash on that line.

Syntax: [ Download ] [ Hide ]
([^/]+?),

Capture the city, which ends with a comma ( , )

Syntax: [ Download ] [ Hide ]
\s*(\w+)\s+

Capture the state (there may be a variable number of spaces before and after the state)

Syntax: [ Download ] [ Hide ]
(\d+(?:-\d+)?)

Capture the zip code, which may have the -XXXX extension

Syntax: [ Download ] [ Hide ]
\s*$

Anchor to the end of line. Could have spaces after zip.


You are right, my example text did not include all possible cases. Here's one that shows something different:
Syntax: [ Download ] [ Hide ]
 ALLIANCE FLOORING, INC                                                                                               36646125D  01/23/08
         4711 WEST METAIRIE AVENUE / METAIRIE, LA  70001-0000


Regardless of all possible cases, however, the regex captures all the expected cases, BUT something makes the preg_match_all stop matching when it encounters that chunk of text that I mentioned before. Here is my understanding of how preg_match_all works... it's probably more optimized than this though.

Go through text one character at a time. Try to match it from there trying various lengths of matches for + and * operators. Store all matches and captures in an array. Then step to the next character and repeat.

If this is in fact roughly how preg_match_all works, then it should match the lines at the end of file, no matter what I put before them. This obviously breaks.

What's wrong with my thinking here?
bigzero
Forum Newbie
 
Posts: 6
Joined: Tue Jun 30, 2009 2:26 pm

Re: preg_match_all stops in the middle of the string

Postby prometheuzz » Wed Jul 01, 2009 5:03 am

Thanks for the explanation!

The pattern you describe does indeed match the four entries you want separately. But, the fact that it doesn't grab the last three when matching the entire text, means that your regex, at a certain point, matches things it shouldn't, causing the later three valid matches to NOT match. Where exactly this happens can only be found by carefully debugging your pattern (which I'm not eager to do... sorry). But I wouldn't spend too much time on that since it isn't a very good pattern: it matches too greedily in certain cases* and you match a character like '/' as part of the address while that character also occurs in date-strings in your text. I'm not saying that you fault lies there, but those are the things where things might be starting to get wrong. Fixing your current regex may very well lead to finding some other bug at a later time.

* \s not only matches white spaces, but also new line characters, while it is important (in your text) that these are to be treated differently.

If I were you, I'd simply use a more "strict" regex where "match-overflows" will not occur instead of trying to fix the leak in your current one:

What you're looking for is this:

Syntax: [ Download ] [ Hide ]
^[ ]{2}                 // Lines that start with 2 spaces;
 
(\S(?:(?![ ]{3}).)+)    // After those two spaces, match one or more characters
                        // only if the empty string before that character does not
                        // have 3 spaces in front of it, starting with a character
                        // other than a white space character and group this match;
 
.*$                     // Consume all characters of the line;
 
\r?\n                   // Match a *nix/MacOS or Windows new line character;
 
^[ ]*                   // Consume the white spaces that are at the start of the next line;
 
(?:([^/\r\n]+)[ ]/)?    // Optionally match, and group, one or more characters other than a backslash
                        // and new line characters. This should be followed by a white space and a
                        // backslash.
 
[ ]                     // Match a white space
 
([^,\r\n]+),            // Match one or more characters other than a comma and new line
                        // characters followed by a comma.
 
[ ]*                    // Match zero or more white spaces
 
([A-Z]+)                // State
 
[ ]+                    // Match zero one more white spaces
 
(\d+(?:-\d+)?)          // ZIP code
 
[ ]*$                   // Match zero or more white spaces followed by the end of the line.


Which will look like this in code:

Syntax: [ Download ] [ Hide ]
$text = "  PROMETHEAN BOOKS                                                                                                                    01/24/08
         NEW ORLEANS, LA  70119
 
         FOREIGN PARTNERSHIP REGISTRATIONS:                                                                                     REGISTERED
 
  CRAIN II OIL & GAS, LTD.                                                                                             36643740L  01/22/08
         DOMICILE:  TEXAS
  RS - HAMMOND, LA-1, L.P.                                                                                             36643720L  01/22/08
         DOMICILE: TEXAS
  WASKOM GAS PROCESSING COMPANY                                                                                        36641638L  01/18/08
         DOMICILE:  TEXAS
 
         AMENDMENTS TO FOREIGN PARTNERSHIPS:                                                                                         CHANGED ON
 
  IMTT-VIRGINIA                                                                                                                       01/22/08
         DOMICILE:  DELAWARE
         FROM: IMTT-CHESAPEAKE
  STANDARD AUTOMATION & CONTROL LP                                                                                                    01/18/08
         DOMICILE:  DELAWARE
         FROM: MTL OPEN SYSTEM TECHNOLOGIES LP
 
         TERMINATION OF FOREIGN PARTNERSHIPS:                                                                                           FILED
 
  A2D LP                                                                                                                              01/22/08
         DOMICILE:  TEXAS
 
         DOMESTIC LIMITED LIABILITY COMPANIES:                                                                                     FILED
 
  A AND A'S REAL ESTATE, LLC                                                                                           36645875K  01/24/08
         8224 LEE ST. / SORRENTO, LA 70778
  A BARKING DOG PRODUCTION, LLC                                                                                        36643781K  01/22/08
         584 BARBARA PLACE / MANDEVILLE, LA 70448
  A R BROOKS PAINT CONTRACTING, L.L.C.                                                                                 36647552K  01/24/08
         908 SOUTH 15TH STREET / MONROE, LA  71202"
;
 
preg_match_all('%
  ^[ ]{2}
  (\S(?:(?![ ]{3}).)+)
  .*$
  \r?\n
  ^[ ]*
  (?:([^/\r\n]+)[ ]/)?
  [ ]
  ([^,\r\n]+),
  [ ]*([A-Z]+)[ ]+
  (\d+(?:-\d+)?)
  [ ]*$
%mx'
, $text, $matches);
 
echo '<pre>';
print_r($matches);
echo '</pre>';
 
/* output:
 
Array
(
    [0] => Array
        (
            [0] =>   PROMETHEAN BOOKS                                                                                                                    01/24/08
         NEW ORLEANS, LA  70119
            [1] =>   A AND A'S REAL ESTATE, LLC                                                                                           36645875K  01/24/08
         8224 LEE ST. / SORRENTO, LA 70778
            [2] =>   A BARKING DOG PRODUCTION, LLC                                                                                        36643781K  01/22/08
         584 BARBARA PLACE / MANDEVILLE, LA 70448
            [3] =>   A R BROOKS PAINT CONTRACTING, L.L.C.                                                                                 36647552K  01/24/08
         908 SOUTH 15TH STREET / MONROE, LA  71202
        )
 
    [1] => Array
        (
            [0] => PROMETHEAN BOOKS
            [1] => A AND A'S REAL ESTATE, LLC
            [2] => A BARKING DOG PRODUCTION, LLC
            [3] => A R BROOKS PAINT CONTRACTING, L.L.C.
        )
 
    [2] => Array
        (
            [0] =>
            [1] => 8224 LEE ST.
            [2] => 584 BARBARA PLACE
            [3] => 908 SOUTH 15TH STREET
        )
 
    [3] => Array
        (
            [0] => NEW ORLEANS
            [1] => SORRENTO
            [2] => MANDEVILLE
            [3] => MONROE
        )
 
    [4] => Array
        (
            [0] => LA
            [1] => LA
            [2] => LA
            [3] => LA
        )
 
    [5] => Array
        (
            [0] => 70119
            [1] => 70778
            [2] => 70448
            [3] => 71202
        )
 
)
 
*/
User avatar
prometheuzz
Forum Regular
 
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: preg_match_all stops in the middle of the string

Postby bigzero » Wed Jul 01, 2009 12:22 pm

Thanks for trying to figure this out!

I copy pasted what you had into my php script and it's not matching anything for me See here and here

The regex looks correct to me, which makes me even more confused. I did find a tiny mistake, but it should not make any difference anyway.

Syntax: [ Download ] [ Hide ]
(\S(?:(?![ ]{3}).)+)

This would capture 2 spaces at the end in addition to the company name. I know how to fix that, so it's not a problem, but I am still confused why your regex does not match anything and my original regex stops in the middle of the file.

Thank again
bigzero
Forum Newbie
 
Posts: 6
Joined: Tue Jun 30, 2009 2:26 pm

Re: preg_match_all stops in the middle of the string

Postby prometheuzz » Wed Jul 01, 2009 1:21 pm

bigzero wrote:Thanks for trying to figure this out!


No problem.

bigzero wrote:I copy pasted what you had into my php script and it's not matching anything for me See here and here

The regex looks correct to me, which makes me even more confused. I did find a tiny mistake, but it should not make any difference anyway.


The forum software somehow transformed many of the double spaces at the start of the line into single spaces... Let me try again:

Syntax: [ Download ] [ Hide ]
$text = "  PROMETHEAN BOOKS                                                                                                                    01/24/08
        NEW ORLEANS, LA  70119
 
        FOREIGN PARTNERSHIP REGISTRATIONS:                                                                                     REGISTERED
 
  CRAIN II OIL & GAS, LTD.                                                                                             36643740L  01/22/08
        DOMICILE:  TEXAS
  RS - HAMMOND, LA-1, L.P.                                                                                             36643720L  01/22/08
        DOMICILE: TEXAS
  WASKOM GAS PROCESSING COMPANY                                                                                        36641638L  01/18/08
        DOMICILE:  TEXAS
 
        AMENDMENTS TO FOREIGN PARTNERSHIPS:                                                                                         CHANGED ON
 
  IMTT-VIRGINIA                                                                                                                       01/22/08
        DOMICILE:  DELAWARE
        FROM: IMTT-CHESAPEAKE
  STANDARD AUTOMATION & CONTROL LP                                                                                                    01/18/08
        DOMICILE:  DELAWARE
        FROM: MTL OPEN SYSTEM TECHNOLOGIES LP
 
        TERMINATION OF FOREIGN PARTNERSHIPS:                                                                                           FILED
 
  A2D LP                                                                                                                              01/22/08
        DOMICILE:  TEXAS
 
        DOMESTIC LIMITED LIABILITY COMPANIES:                                                                                     FILED
 
  A AND A'S REAL ESTATE, LLC                                                                                           36645875K  01/24/08
        8224 LEE ST. / SORRENTO, LA 70778
  A BARKING DOG PRODUCTION, LLC                                                                                        36643781K  01/22/08
        584 BARBARA PLACE / MANDEVILLE, LA 70448
  A R BROOKS PAINT CONTRACTING, L.L.C.                                                                                 36647552K  01/24/08
        908 SOUTH 15TH STREET / MONROE, LA  71202"
;
 
 
preg_match_all('%
  ^[ ]{2}
  (\S(?:(?![ ]{3}).)+)
  .*$
  \r?\n
  ^[ ]*
  (?:([^/\r\n]+)[ ]/)?
  [ ]
  ([^,\r\n]+),
  [ ]*([A-Z]+)[ ]+
  (\d+(?:-\d+)?)
  [ ]*$
%mx'
, $text, $matches);
 
echo '<pre>';
print_r($matches);
echo '</pre>';


Which produces the following output on both my web server as on my PHP-CLI:

Syntax: [ Download ] [ Hide ]
Array
(
    [0] => Array
        (
            [0] =>   PROMETHEAN BOOKS                                                                                                                    01/24/08
        NEW ORLEANS, LA  70119
            [1] =>   A AND A'S REAL ESTATE, LLC                                                                                           36645875K  01/24/08
        8224 LEE ST. / SORRENTO, LA 70778
            [2] =>   A BARKING DOG PRODUCTION, LLC                                                                                        36643781K  01/22/08
        584 BARBARA PLACE / MANDEVILLE, LA 70448
            [3] =>   A R BROOKS PAINT CONTRACTING, L.L.C.                                                                                 36647552K  01/24/08
        908 SOUTH 15TH STREET / MONROE, LA  71202
        )
 
    [1] => Array
        (
            [0] => PROMETHEAN BOOKS
            [1] => A AND A'
S REAL ESTATE, LLC
            [2] => A BARKING DOG PRODUCTION, LLC
            [3] => A R BROOKS PAINT CONTRACTING, L.L.C.
        )
 
    [2] => Array
        (
            [0] =>
            [1] => 8224 LEE ST.
            [2] => 584 BARBARA PLACE
            [3] => 908 SOUTH 15TH STREET
        )
 
    [3] => Array
        (
            [0] => NEW ORLEANS
            [1] => SORRENTO
            [2] => MANDEVILLE
            [3] => MONROE
        )
 
    [4] => Array
        (
            [0] => LA
            [1] => LA
            [2] => LA
            [3] => LA
        )
 
    [5] => Array
        (
            [0] => 70119
            [1] => 70778
            [2] => 70448
            [3] => 71202
        )
 
)


bigzero wrote:
Syntax: [ Download ] [ Hide ]
(\S(?:(?![ ]{3}).)+)

This would capture 2 spaces at the end in addition to the company name. ...


No, that look-ahead will not be a part of the match. See this snippet:

Syntax: [ Download ] [ Hide ]
$text = 'ABC DEF  GHI    JKL';
if(preg_match('/(\S(?:(?![ ]{3}).)+)/', $text, $match)) {
  echo '>' . $match[0] . '<';
}
 
// output:
//             >ABC DEF  GHI<
User avatar
prometheuzz
Forum Regular
 
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: preg_match_all stops in the middle of the string

Postby prometheuzz » Wed Jul 01, 2009 1:35 pm

bigzero wrote:Thanks for trying to figure this out!

I copy pasted what you had into my php script and it's not matching anything for me See here and here ...


I tried your pages and indeed THEY don't work.
But like I said, my web server produces the correct output, my command line PHP interpreter produces the same and the on line tool I often use also produces the exact same output... So, I am fairly sure you're doing something incorrect...
This is the tool I used:

http://regex.larsolavtorvik.com/

Note that it uses the '/' as a delimiter so you'll have to escape those characters in your regex. Here is my regex with the '/'-s already escaped and the multi-line-flag also enabled:

Syntax: [ Download ] [ Hide ]
(?m)^[ ]{2}(\S(?:(?![ ]{3}).)+).*$\r?\n^[ ]*(?:([^\/\r\n]+)[ ]\/)?[ ]([^,\r\n]+),[ ]*([A-Z]+)[ ]+(\d+(?:-\d+)?)[ ]*$


You only have to copy and paste this regex and the example text and you'll see it really works.

Good luck.
User avatar
prometheuzz
Forum Regular
 
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: preg_match_all stops in the middle of the string

Postby bigzero » Wed Jul 01, 2009 2:18 pm

Does the following regex produces matches on your web server or your PHP-CLI?

Syntax: [ Download ] [ Hide ]
/(?m)^[ ]{2}(\S.*?)[ ]{3}.*\r?\n[ ]*(?:(\S.*)[ ]+\/[ ]+)?([^\/\r\n]+?)[ ]*,[ ]*(\w+)[ ]+(\d+(?:-\d+)?)(?: 8B)?[ ]*$/


I am checking it against test1.txt

The regex interpreter you linked to is very nice, but I had been using a different one in the past, so I decided to check the regex with the other one as well.

The regex above works fine on http://regex.larsolavtorvik.com/ but does not work on http://www.solmetra.com/scripts/regex/index.php

It doesn't work on my server either.

My phpinfo can be found here

Thanks for everything!
Sorry to keep bothering you, but this is really annoying me :(

EDIT:
P.S. I have all the flags turned off when testing on regex.larsolavtorvik.com
bigzero
Forum Newbie
 
Posts: 6
Joined: Tue Jun 30, 2009 2:26 pm

Re: preg_match_all stops in the middle of the string

Postby prometheuzz » Wed Jul 01, 2009 2:51 pm

I have used many such on line regex testers and found bugs in almost all of them, so my guess is that the tool you linked to is just broken.
I have just tested my rgex with Java 1.6's java.util.regex package and with Perl 5.10.0 and both produce exactly what my PHP CLI and web servers produces: all four matches are found (with the same regex).

You can test it yourself:

Java:

Syntax: [ Download ] [ Hide ]
public class Main {
   
    public static void main(String[] args) {
   
        String text = "  PROMETHEAN BOOKS                                                                                                                    01/24/08"+"\n"+
                "        NEW ORLEANS, LA  70119"+"\n"+
                " "+"\n"+
                "        FOREIGN PARTNERSHIP REGISTRATIONS:                                                                                     REGISTERED"+"\n"+
                " "+"\n"+
                "  CRAIN II OIL & GAS, LTD.                                                                                             36643740L  01/22/08"+"\n"+
                "        DOMICILE:  TEXAS"+"\n"+
                "  RS - HAMMOND, LA-1, L.P.                                                                                             36643720L  01/22/08"+"\n"+
                "        DOMICILE: TEXAS"+"\n"+
                "  WASKOM GAS PROCESSING COMPANY                                                                                        36641638L  01/18/08"+"\n"+
                "        DOMICILE:  TEXAS"+"\n"+
                " "+"\n"+
                "        AMENDMENTS TO FOREIGN PARTNERSHIPS:                                                                                         CHANGED ON"+"\n"+
                " "+"\n"+
                "  IMTT-VIRGINIA                                                                                                                       01/22/08"+"\n"+
                "        DOMICILE:  DELAWARE"+"\n"+
                "        FROM: IMTT-CHESAPEAKE"+"\n"+
                "  STANDARD AUTOMATION & CONTROL LP                                                                                                    01/18/08"+"\n"+
                "        DOMICILE:  DELAWARE"+"\n"+
                "        FROM: MTL OPEN SYSTEM TECHNOLOGIES LP"+"\n"+
                " "+"\n"+
                "        TERMINATION OF FOREIGN PARTNERSHIPS:                                                                                           FILED"+"\n"+
                " "+"\n"+
                "  A2D LP                                                                                                                              01/22/08"+"\n"+
                "        DOMICILE:  TEXAS"+"\n"+
                " "+"\n"+
                "        DOMESTIC LIMITED LIABILITY COMPANIES:                                                                                     FILED"+"\n"+
                " "+"\n"+
                "  A AND A'S REAL ESTATE, LLC                                                                                           36645875K  01/24/08"+"\n"+
                "        8224 LEE ST. / SORRENTO, LA 70778"+"\n"+
                "  A BARKING DOG PRODUCTION, LLC                                                                                        36643781K  01/22/08"+"\n"+
                "        584 BARBARA PLACE / MANDEVILLE, LA 70448"+"\n"+
                "  A R BROOKS PAINT CONTRACTING, L.L.C.                                                                                 36647552K  01/24/08"+"\n"+
                "        908 SOUTH 15TH STREET / MONROE, LA  71202";
       
        String regex = "(?m)^[ ]{2}(\\S(?:(?![ ]{3}).)+).*$\\r?\\n^[ ]*(?:([^/\\r\\n]+)[ ]/)?[ ]([^,\\r\\n]+),[ ]*([A-Z]+)[ ]+(\\d+(?:-\\d+)?)[ ]*$";
       
        Matcher m = Pattern.compile(regex).matcher(text);
       
        while(m.find()) {
            System.out.println(m.group(1));
            System.out.println(m.group(2));
            System.out.println(m.group(3));
            System.out.println(m.group(4));
            System.out.println();
        }
    }
}
 
/* command line output:
PROMETHEAN BOOKS
null
NEW ORLEANS
LA
 
A AND A'S REAL ESTATE, LLC
8224 LEE ST.
SORRENTO
LA
 
A BARKING DOG PRODUCTION, LLC
584 BARBARA PLACE
MANDEVILLE
LA
 
A R BROOKS PAINT CONTRACTING, L.L.C.
908 SOUTH 15TH STREET
MONROE
LA
*/


Perl:

Syntax: [ Download ] [ Hide ]
#!/usr/bin/perl
 
$text = "  PROMETHEAN BOOKS                                                                                                                    01/24/08
        NEW ORLEANS, LA  70119
 
        FOREIGN PARTNERSHIP REGISTRATIONS:                                                                                     REGISTERED
 
  CRAIN II OIL & GAS, LTD.                                                                                             36643740L  01/22/08
        DOMICILE:  TEXAS
  RS - HAMMOND, LA-1, L.P.                                                                                             36643720L  01/22/08
        DOMICILE: TEXAS
  WASKOM GAS PROCESSING COMPANY                                                                                        36641638L  01/18/08
        DOMICILE:  TEXAS
 
        AMENDMENTS TO FOREIGN PARTNERSHIPS:                                                                                         CHANGED ON
 
  IMTT-VIRGINIA                                                                                                                       01/22/08
        DOMICILE:  DELAWARE
        FROM: IMTT-CHESAPEAKE
  STANDARD AUTOMATION & CONTROL LP                                                                                                    01/18/08
        DOMICILE:  DELAWARE
        FROM: MTL OPEN SYSTEM TECHNOLOGIES LP
 
        TERMINATION OF FOREIGN PARTNERSHIPS:                                                                                           FILED
 
  A2D LP                                                                                                                              01/22/08
        DOMICILE:  TEXAS
 
        DOMESTIC LIMITED LIABILITY COMPANIES:                                                                                     FILED
 
  A AND A'S REAL ESTATE, LLC                                                                                           36645875K  01/24/08
        8224 LEE ST. / SORRENTO, LA 70778
  A BARKING DOG PRODUCTION, LLC                                                                                        36643781K  01/22/08
        584 BARBARA PLACE / MANDEVILLE, LA 70448
  A R BROOKS PAINT CONTRACTING, L.L.C.                                                                                 36647552K  01/24/08
        908 SOUTH 15TH STREET / MONROE, LA  71202"
;
 
while ($text =~ m/^[ ]{2}(\S(?:(?![ ]{3}).)+).*$\r?\n^[ ]*(?:([^\/\r\n]+)[ ]\/)?[ ]([^,\r\n]+),[ ]*([A-Z]+)[ ]+(\d+(?:-\d+)?)[ ]*$/gm) {
  print "$1\n$2\n$3\n$4\n$5\n\n";
}
 
# command line output
#
# PROMETHEAN BOOKS
#
# NEW ORLEANS
# LA
# 70119
#
# A AND A'S REAL ESTATE, LLC
# 8224 LEE ST.
# SORRENTO
# LA
# 70778
#
# A BARKING DOG PRODUCTION, LLC
# 584 BARBARA PLACE
# MANDEVILLE
# LA
# 70448
#
# A R BROOKS PAINT CONTRACTING, L.L.C.
# 908 SOUTH 15TH STREET
# MONROE
# LA
# 71202
User avatar
prometheuzz
Forum Regular
 
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: preg_match_all stops in the middle of the string

Postby bigzero » Wed Jul 01, 2009 3:09 pm

Any chance you could post what version of PHP and PCRE you're using?
bigzero
Forum Newbie
 
Posts: 6
Joined: Tue Jun 30, 2009 2:26 pm

Re: preg_match_all stops in the middle of the string

Postby prometheuzz » Wed Jul 01, 2009 3:20 pm

bigzero wrote:Any chance you could post what version of PHP and PCRE you're using?


Sure, PHP-CLI:

Syntax: [ Download ] [ Hide ]
$ php -i | grep PCRE
PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 7.8 2008-09-05


Web server: http://iruimte.nl/php/
Syntax: [ Download ] [ Hide ]
PCRE (Perl Compatible Regular Expressions) Support  enabled
PCRE Library Version    7.8 2008-09-05
User avatar
prometheuzz
Forum Regular
 
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am


Return to Regex

Who is online

Users browsing this forum: No registered users and 3 guests