Negative Lookahead Ceases to Function When Preceded by \s+

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Negative Lookahead Ceases to Function When Preceded by \s+

Post by Ollie Saunders »

I don't get why this is happening perhaps someone can explain.

This shouldn't match anything but does:

Code: Select all

$str = "foo  bar";
preg_match("/^foo\s+(?!bar)/i", $str, $matches);
print_r($matches);

Code: Select all

Array
(
    [0] => foo 
)
Remove the + from the \s and it works as expected:

Code: Select all

$str = "foo  bar";
preg_match("/^foo\s\s(?!bar)/i", $str, $matches);
print_r($matches);

Code: Select all

Array
(
)
The workaround was to move \s+ into the lookahead but I still wonder why the first was behaving as it was:

Code: Select all

$str = "foo  bar";
preg_match("/^foo(?!\s+bar)/i", $str, $matches);
print_r($matches);

Code: Select all

Array
(
)
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Post by prometheuzz »

ole wrote: ...
The workaround was to move \s+ into the lookahead but I still wonder why the first was behaving as it was:

Code: Select all

$str = "foo  bar";
...
[/quote]

Although the [b]\s+[/b] is greedy, and thus matches all the spaces after "foo", when it hits the "bar", it is backtracking one [b]\s[/b] since a match is favoured. PHP's regex engine will always try to match the complete regex, so the second [b]\s[/b] will be matched by the [b](?!bar)[/b].
If you don't want the regex engine to give up a (partial) match in favour of a complete match, in your case, you will need to make the [b]\s[/b] besides greedy ([b]\s+[/b]), also possessive: [b]\s++[/b]. This translates to: "match one or more white space characters, and when matching one, never give it up". So there's no backtracking.

Hope that makes it clear to you, if not, feel free to post back.
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Post by Ollie Saunders »

Although the \s+ is greedy, and thus matches all the spaces after "foo", when it hits the "bar", it is backtracking one \s since a match is favoured.
How is it backtracking and which of the three variants I posted are you referring too.
PHP's regex engine will always try to match the complete regex, so the second \s will be matched by the (?!bar).
Gah? \s cannot match b
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Post by prometheuzz »

ole wrote:
Although the \s+ is greedy, and thus matches all the spaces after "foo", when it hits the "bar", it is backtracking one \s since a match is favoured.
How is it backtracking and which of the three variants I posted are you referring too.
While matching the string, PHP's regex engine keep track of all states that have been matched so far (as all NFA-based engines do).
Whenever it reaches a substring that does not match the regex, it "backtracks" to the last state it did match and then tries to match the entire regex again.
This is what I mean:

Code: Select all

regex = /^foo\s+(?!bar)/
 
text = "foo  bar"
 
state  matched string
    1             "f"
    2            "fo"
    3           "foo"
    4          "foo "
    5         "foo  "
now, at this moment the next string in the text is "bar", but since (?!bar)
"forbids" this match, the regex engine backtracks to state 4 and then matches
the 2nd white space with (?!bar), so the overall match is "foo ".

ole wrote:
PHP's regex engine will always try to match the complete regex, so the second \s will be matched by the (?!bar).
Gah? \s cannot match b
It is negative look-ahead, so as long as it's not "bar", it will match.
Run this:

Code: Select all

if(preg_match('/(?!bar)/',' ')) {
  echo "Match!\n";
} else {
  echo "No match...\n";  
}
// output will be: "Match!"
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Post by Ollie Saunders »

Thanks prometheuzz! I was having a dense moment and it took an explanation as awesome as yours to pull me out of it.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Post by prometheuzz »

ole wrote:Thanks prometheuzz! I was having a dense moment and it took an explanation as awesome as yours to pull me out of it.
You're most welcome.
Just for the record: whenever you use a possessive +, as in:

Code: Select all

$str = "foo  bar";
preg_match("/^foo\s++(?!bar)/i", $str, $matches);
the two white spaces will never be given up by the regex engine: ie, the regex engine will never backtrack to a "previous state". Especially when working with large strings, this can be a significant performance improvement (the not backtracking).
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Re: Negative Lookahead Ceases to Function When Preceded by \s+

Post by Ollie Saunders »

Noted, thanks
Post Reply