Page 1 of 1

Negative Lookahead Ceases to Function When Preceded by \s+

Posted: Mon May 05, 2008 5:46 pm
by Ollie Saunders
I don't get why this is happening perhaps someone can explain.

This shouldn't match anything but does:

Code: Select all

$str = "foo  bar";
preg_match("/^foo\s+(?!bar)/i", $str, $matches);
print_r($matches);

Code: Select all

Array
(
    [0] => foo 
)
Remove the + from the \s and it works as expected:

Code: Select all

$str = "foo  bar";
preg_match("/^foo\s\s(?!bar)/i", $str, $matches);
print_r($matches);

Code: Select all

Array
(
)
The workaround was to move \s+ into the lookahead but I still wonder why the first was behaving as it was:

Code: Select all

$str = "foo  bar";
preg_match("/^foo(?!\s+bar)/i", $str, $matches);
print_r($matches);

Code: Select all

Array
(
)

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Posted: Tue May 06, 2008 1:30 am
by prometheuzz
ole wrote: ...
The workaround was to move \s+ into the lookahead but I still wonder why the first was behaving as it was:

Code: Select all

$str = "foo  bar";
...
[/quote]

Although the [b]\s+[/b] is greedy, and thus matches all the spaces after "foo", when it hits the "bar", it is backtracking one [b]\s[/b] since a match is favoured. PHP's regex engine will always try to match the complete regex, so the second [b]\s[/b] will be matched by the [b](?!bar)[/b].
If you don't want the regex engine to give up a (partial) match in favour of a complete match, in your case, you will need to make the [b]\s[/b] besides greedy ([b]\s+[/b]), also possessive: [b]\s++[/b]. This translates to: "match one or more white space characters, and when matching one, never give it up". So there's no backtracking.

Hope that makes it clear to you, if not, feel free to post back.

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Posted: Wed May 07, 2008 4:20 pm
by Ollie Saunders
Although the \s+ is greedy, and thus matches all the spaces after "foo", when it hits the "bar", it is backtracking one \s since a match is favoured.
How is it backtracking and which of the three variants I posted are you referring too.
PHP's regex engine will always try to match the complete regex, so the second \s will be matched by the (?!bar).
Gah? \s cannot match b

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Posted: Wed May 07, 2008 4:41 pm
by prometheuzz
ole wrote:
Although the \s+ is greedy, and thus matches all the spaces after "foo", when it hits the "bar", it is backtracking one \s since a match is favoured.
How is it backtracking and which of the three variants I posted are you referring too.
While matching the string, PHP's regex engine keep track of all states that have been matched so far (as all NFA-based engines do).
Whenever it reaches a substring that does not match the regex, it "backtracks" to the last state it did match and then tries to match the entire regex again.
This is what I mean:

Code: Select all

regex = /^foo\s+(?!bar)/
 
text = "foo  bar"
 
state  matched string
    1             "f"
    2            "fo"
    3           "foo"
    4          "foo "
    5         "foo  "
now, at this moment the next string in the text is "bar", but since (?!bar)
"forbids" this match, the regex engine backtracks to state 4 and then matches
the 2nd white space with (?!bar), so the overall match is "foo ".

ole wrote:
PHP's regex engine will always try to match the complete regex, so the second \s will be matched by the (?!bar).
Gah? \s cannot match b
It is negative look-ahead, so as long as it's not "bar", it will match.
Run this:

Code: Select all

if(preg_match('/(?!bar)/',' ')) {
  echo "Match!\n";
} else {
  echo "No match...\n";  
}
// output will be: "Match!"

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Posted: Wed May 07, 2008 5:08 pm
by Ollie Saunders
Thanks prometheuzz! I was having a dense moment and it took an explanation as awesome as yours to pull me out of it.

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Posted: Thu May 08, 2008 1:19 am
by prometheuzz
ole wrote:Thanks prometheuzz! I was having a dense moment and it took an explanation as awesome as yours to pull me out of it.
You're most welcome.
Just for the record: whenever you use a possessive +, as in:

Code: Select all

$str = "foo  bar";
preg_match("/^foo\s++(?!bar)/i", $str, $matches);
the two white spaces will never be given up by the regex engine: ie, the regex engine will never backtrack to a "previous state". Especially when working with large strings, this can be a significant performance improvement (the not backtracking).

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Posted: Thu May 08, 2008 2:04 am
by Ollie Saunders
Noted, thanks