Regexp Crisis

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
bdawg
Forum Newbie
Posts: 3
Joined: Fri Jan 25, 2008 7:40 pm

Regexp Crisis

Post by bdawg »

So, for some reason a cron script I have to crawl for some data is choking and for the life of me I can't figure out why. In the meantime, clients have already noticed the issue and are calling in a panic. Any help you could offer would be most helpful.

Here's my code snippet:

Code: Select all

 
$futureURL = "http://www1.leg.wa.gov/legislature/showagenda.aspx?chamber=house&start=2008-01-30";
$html = file_get_contents("$futureURL");
if (!$html) {
    die("Error retrieving URL");
}
 
preg_match_all('#([^>]+)</h1>(.+?)(<h1|</html>)#is', $html, $outer_matches, PREG_SET_ORDER);
 
$outer_matches has nothing in it when this runs on this date. However, if I use another date in the $futureURL string (2008-02-01). I know $html has the data in it -- i checked that.

Any idea what might be causing this to fail? Those two pages look the same to me. Any thoughts would be most helpful. Thanks.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Regexp Crisis

Post by Chris Corbyn »

Is the page you're matching against longer than usual? preg_match and friends have a hard limit on the length of a string they can match. It's compiled into the PCRE library so not easily changed.
bdawg
Forum Newbie
Posts: 3
Joined: Fri Jan 25, 2008 7:40 pm

Re: Regexp Crisis

Post by bdawg »

I don't think it's significantly larger than usual -- 50KB vs. 30KB maybe.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Regexp Crisis

Post by Chris Corbyn »

i.e. 60% longer ;) Perhaps if you tried splitting the page into chunks, then matching across those chunks?
User avatar
JAM
DevNet Resident
Posts: 2101
Joined: Fri Aug 08, 2003 6:53 pm
Location: Sweden
Contact:

Re: Regexp Crisis

Post by JAM »

Something to play around with aswell is to strip_tags all things you dont need in your search, example;

Code: Select all

   $html = strip_tags(file_get_contents($futureURL), '<h1><html>');
bdawg
Forum Newbie
Posts: 3
Joined: Fri Jan 25, 2008 7:40 pm

Re: Regexp Crisis

Post by bdawg »

I found a solution ... pcre.backtrack.limit was the culprit. That is "100000" by default. I increased that by putting an entry in php.ini and that fixed the problem.

thx.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Regexp Crisis

Post by Chris Corbyn »

bdawg wrote:I found a solution ... pcre.backtrack.limit was the culprit. That is "100000" by default. I increased that by putting an entry in php.ini and that fixed the problem.

thx.
Useful tip thanks :) I actually never knew that ini directive existed.

Just looking at the manual, it appears that pcre.backtrack_limit is actually settable with ini_set() too if you wanted to make the code more portable you could adjust it in the code itself :)
Post Reply