This is my regex:
%^\\s{2}(?#company name)(\\S.*?)[ ]{3,}.*\\n\\s*(?#optional street address)(?:(\\S.*)\\s+/\\s+)?(?#city)([^/]+?),\\s*(?#state)(\\w+)\\s+(?#5 digit zip)(\\d+(?:-\\d+)?)\\s*$%m
Shorter, without comments:
%^\\s{2}(\\S.*?)[ ]{3,}.*\\n\\s*(?:(\\S.*)\\s+/\\s+)?([^/]+?),\\s*(\\w+)\\s+(\\d+(?:-\\d+)?)\\s*$%m
Without double slashes for easier readability:
%^\s{2}(\S.*?)[ ]{3,}.*\n\s*(?:(\S.*)\s+/\s+)?([^/]+?),\s*(\w+)\s+(\d+(?:-\d+)?)\s*$%m
I wrote a php script on my server to show what happens:
<?php
$data = file_get_contents('test1.txt');
$regex = '%^\\s{2}(\\S.*?)[ ]{3,}.*\\n\\s*(?:(\\S.*)\\s+/\\s+)?([^/]+?),\\s*(\\w+)\\s+(\\d+(?:-\\d+)?)\\s*$%m';
echo htmlentities($regex);
echo '<br /><br />';
echo htmlentities($data);
echo '<br /><br />';
preg_match_all($regex,$data,$matches);
$matches[0] = null;
print_r($matches);
?>
$data = file_get_contents('test1.txt');
$regex = '%^\\s{2}(\\S.*?)[ ]{3,}.*\\n\\s*(?:(\\S.*)\\s+/\\s+)?([^/]+?),\\s*(\\w+)\\s+(\\d+(?:-\\d+)?)\\s*$%m';
echo htmlentities($regex);
echo '<br /><br />';
echo htmlentities($data);
echo '<br /><br />';
preg_match_all($regex,$data,$matches);
$matches[0] = null;
print_r($matches);
?>
The first script tries to parse the text file with the section of text that causes preg_match_all to stop (lines 3 through 28). The second script parses the text file without that section.
The way I understand regex and preg_match_all is that even if there is something in that section of text that does not match, or matches from there to the end of file for instance, it should still be matching the text after. (I hope that made sense). That's why I don't think there's anything wrong with the regex... but I could be wrong.
Can anyone explain this behavior?
Thank you in advance!
