preg_match_all stops in the middle of the string
Posted: Tue Jun 30, 2009 2:46 pm
Hello everyone. I was working on this project where I had to extract company names and addresses from a large text file and everything seemed to be working fine. But then I noticed that preg_match_all was "stopping" after it encountered a certain section in the file. If I remove that section, it continues to match about 600 more companies.
This is my regex:
Shorter, without comments:
Without double slashes for easier readability:
I wrote a php script on my server to show what happens:
The first script tries to parse the text file with the section of text that causes preg_match_all to stop (lines 3 through 28). The second script parses the text file without that section.
The way I understand regex and preg_match_all is that even if there is something in that section of text that does not match, or matches from there to the end of file for instance, it should still be matching the text after. (I hope that made sense). That's why I don't think there's anything wrong with the regex... but I could be wrong.
Can anyone explain this behavior?
Thank you in advance!
This is my regex:
Code: Select all
%^\\s{2}(?#company name)(\\S.*?)[ ]{3,}.*\\n\\s*(?#optional street address)(?:(\\S.*)\\s+/\\s+)?(?#city)([^/]+?),\\s*(?#state)(\\w+)\\s+(?#5 digit zip)(\\d+(?:-\\d+)?)\\s*$%mCode: Select all
%^\\s{2}(\\S.*?)[ ]{3,}.*\\n\\s*(?:(\\S.*)\\s+/\\s+)?([^/]+?),\\s*(\\w+)\\s+(\\d+(?:-\\d+)?)\\s*$%mCode: Select all
%^\s{2}(\S.*?)[ ]{3,}.*\n\s*(?:(\S.*)\s+/\s+)?([^/]+?),\s*(\w+)\s+(\d+(?:-\d+)?)\s*$%mCode: Select all
<?php
$data = file_get_contents('test1.txt');
$regex = '%^\\s{2}(\\S.*?)[ ]{3,}.*\\n\\s*(?:(\\S.*)\\s+/\\s+)?([^/]+?),\\s*(\\w+)\\s+(\\d+(?:-\\d+)?)\\s*$%m';
echo htmlentities($regex);
echo '<br /><br />';
echo htmlentities($data);
echo '<br /><br />';
preg_match_all($regex,$data,$matches);
$matches[0] = null;
print_r($matches);
?>The way I understand regex and preg_match_all is that even if there is something in that section of text that does not match, or matches from there to the end of file for instance, it should still be matching the text after. (I hope that made sense). That's why I don't think there's anything wrong with the regex... but I could be wrong.
Can anyone explain this behavior?
Thank you in advance!