Page 1 of 1

Regexp Crisis

Posted: Fri Jan 25, 2008 7:49 pm
by bdawg
So, for some reason a cron script I have to crawl for some data is choking and for the life of me I can't figure out why. In the meantime, clients have already noticed the issue and are calling in a panic. Any help you could offer would be most helpful.

Here's my code snippet:

Code: Select all

 
$futureURL = "http://www1.leg.wa.gov/legislature/showagenda.aspx?chamber=house&start=2008-01-30";
$html = file_get_contents("$futureURL");
if (!$html) {
    die("Error retrieving URL");
}
 
preg_match_all('#([^>]+)</h1>(.+?)(<h1|</html>)#is', $html, $outer_matches, PREG_SET_ORDER);
 
$outer_matches has nothing in it when this runs on this date. However, if I use another date in the $futureURL string (2008-02-01). I know $html has the data in it -- i checked that.

Any idea what might be causing this to fail? Those two pages look the same to me. Any thoughts would be most helpful. Thanks.

Re: Regexp Crisis

Posted: Fri Jan 25, 2008 9:22 pm
by Chris Corbyn
Is the page you're matching against longer than usual? preg_match and friends have a hard limit on the length of a string they can match. It's compiled into the PCRE library so not easily changed.

Re: Regexp Crisis

Posted: Fri Jan 25, 2008 10:22 pm
by bdawg
I don't think it's significantly larger than usual -- 50KB vs. 30KB maybe.

Re: Regexp Crisis

Posted: Sat Jan 26, 2008 12:26 am
by Chris Corbyn
i.e. 60% longer ;) Perhaps if you tried splitting the page into chunks, then matching across those chunks?

Re: Regexp Crisis

Posted: Sat Jan 26, 2008 5:02 am
by JAM
Something to play around with aswell is to strip_tags all things you dont need in your search, example;

Code: Select all

   $html = strip_tags(file_get_contents($futureURL), '<h1><html>');

Re: Regexp Crisis

Posted: Sat Jan 26, 2008 11:27 am
by bdawg
I found a solution ... pcre.backtrack.limit was the culprit. That is "100000" by default. I increased that by putting an entry in php.ini and that fixed the problem.

thx.

Re: Regexp Crisis

Posted: Sat Jan 26, 2008 10:52 pm
by Chris Corbyn
bdawg wrote:I found a solution ... pcre.backtrack.limit was the culprit. That is "100000" by default. I increased that by putting an entry in php.ini and that fixed the problem.

thx.
Useful tip thanks :) I actually never knew that ini directive existed.

Just looking at the manual, it appears that pcre.backtrack_limit is actually settable with ini_set() too if you wanted to make the code more portable you could adjust it in the code itself :)