Page 1 of 1

matches unless ...

Posted: Thu Dec 04, 2008 4:06 pm
by javinto
I'd like to match keywords within HTML-texts. These keywords can already be tagged within a HTML link reference -> <a href=""></a> Then they should not match, so

Examples when looking for the word "money"

"I lost my money belt" -> match
"I lost my <a href='search_id'>blue money belt<\a>" -> no match
"I lost my money, but fortunately found a <a href='search_id'>bank<\a>" -> match

To find the word money ain't the problem with the wordboundery /\bmoney\b/. But preventing it from being found within the link.... I just cannot work ik out.

Any suggestions?

Re: matches unless ...

Posted: Fri Dec 05, 2008 4:14 am
by prometheuzz
You probably meant to close your anchor tags with </a> instead of <\a>.

If the tags are properly closed using a slash instead of a backslash, then this will work:

Code: Select all

if(preg_match('~\bmoney\b(?![^<]*</)~i', $text)) {
  echo 'match';
} else {
  echo 'no match';
}

Re: matches unless ...

Posted: Fri Dec 05, 2008 4:34 am
by javinto
Thanks! Just did a little test and it works! The <\a> was a typo indeed.

I tried (?:) however before. What's the difference with (?!) you are using here?

Re: matches unless ...

Posted: Fri Dec 05, 2008 4:39 am
by mintedjo
?: defines a non capturing group, something used just to improve efficiency
?! is negated look ahead

Code: Select all

x(?!abc)
means match 'x' but only if it isnt followed immediately by 'abc'

Re: matches unless ...

Posted: Fri Dec 05, 2008 4:46 am
by prometheuzz
javinto wrote:Thanks! Just did a little test and it works! The <\a> was a typo indeed.

I tried (?:) however before. What's the difference with (?!) you are using here?
(?:) is called a non-capturing group. When you put parenthesis around some characters, like "a(bc)+d", the regex engine "remembers" (or groups) what is matched between those parenthesis. By "remembering" it, the regex engine will consume more time and memory matching your text. So, when you don't use the stuff that is put between parenthesis, you mind as well tell the regex engine to immediately "forget" what is matched between them, thus saving time. You can tell the regex engine to "forget" it by making it a non-capturing group like this: "a(?:bc)+d".

(?!) is called negative look ahead. For example, if you write "a(?!b)", you will match only an 'a' if there's not a 'b' ahead of it. You might think, 'well, what's the difference between "a(?!b)" and "a[^b]"'? In case of a look around (there's also negative and positive look behind and ahead, which is all called look around) the part inside the look around is not "consumed" by the regex engine. Here's an example: if you have a string "zzazz" and you match is against the pattern "a(?!b)", then only the 'a' is matched, while matching it against "a[^b]", the substring "az" is matched.

Hope that clears things up.

Re: matches unless ...

Posted: Fri Dec 05, 2008 4:53 am
by javinto
Wow, thanks guys for the explanations. I did not realize how the regexp grouping actually works.
I suspect I will need those look-ahead functions more often.

Thanks

Re: matches unless ...

Posted: Fri Dec 05, 2008 4:54 am
by prometheuzz
javinto wrote:Thanks! Just did a little test and it works! ...
Of course it works!
; )

In case you need it, a short explanation:

Code: Select all

\bmoney\b    // match 'money' surrounded by word boundaries
(?!          // start negative look ahead
  [^<]*      //   matches zero or more characters of any type except '<'
  </         //   matches the string '</'
)            // stop look ahead
So, in plain English, the regex would read like this: "match the word 'money' surrounded with word boundaries only if there isn't a substring '</' in front of it (ahead) with zero or more characters of any type except '<' in between it".

Re: matches unless ...

Posted: Fri Dec 05, 2008 4:57 am
by prometheuzz
javinto wrote:Wow, thanks guys for the explanations. I did not realize how the regexp grouping actually works.
I suspect I will need those look-ahead functions more often.

Thanks
All about look arounds: http://www.regular-expressions.info/lookaround.html

Note that the entire site is an excellent online resource!