matches unless ...

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
javinto
Forum Newbie
Posts: 3
Joined: Thu Dec 04, 2008 3:54 pm

matches unless ...

Post by javinto »

I'd like to match keywords within HTML-texts. These keywords can already be tagged within a HTML link reference -> <a href=""></a> Then they should not match, so

Examples when looking for the word "money"

"I lost my money belt" -> match
"I lost my <a href='search_id'>blue money belt<\a>" -> no match
"I lost my money, but fortunately found a <a href='search_id'>bank<\a>" -> match

To find the word money ain't the problem with the wordboundery /\bmoney\b/. But preventing it from being found within the link.... I just cannot work ik out.

Any suggestions?
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: matches unless ...

Post by prometheuzz »

You probably meant to close your anchor tags with </a> instead of <\a>.

If the tags are properly closed using a slash instead of a backslash, then this will work:

Code: Select all

if(preg_match('~\bmoney\b(?![^<]*</)~i', $text)) {
  echo 'match';
} else {
  echo 'no match';
}
javinto
Forum Newbie
Posts: 3
Joined: Thu Dec 04, 2008 3:54 pm

Re: matches unless ...

Post by javinto »

Thanks! Just did a little test and it works! The <\a> was a typo indeed.

I tried (?:) however before. What's the difference with (?!) you are using here?
mintedjo
Forum Contributor
Posts: 153
Joined: Wed Nov 19, 2008 6:23 am

Re: matches unless ...

Post by mintedjo »

?: defines a non capturing group, something used just to improve efficiency
?! is negated look ahead

Code: Select all

x(?!abc)
means match 'x' but only if it isnt followed immediately by 'abc'
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: matches unless ...

Post by prometheuzz »

javinto wrote:Thanks! Just did a little test and it works! The <\a> was a typo indeed.

I tried (?:) however before. What's the difference with (?!) you are using here?
(?:) is called a non-capturing group. When you put parenthesis around some characters, like "a(bc)+d", the regex engine "remembers" (or groups) what is matched between those parenthesis. By "remembering" it, the regex engine will consume more time and memory matching your text. So, when you don't use the stuff that is put between parenthesis, you mind as well tell the regex engine to immediately "forget" what is matched between them, thus saving time. You can tell the regex engine to "forget" it by making it a non-capturing group like this: "a(?:bc)+d".

(?!) is called negative look ahead. For example, if you write "a(?!b)", you will match only an 'a' if there's not a 'b' ahead of it. You might think, 'well, what's the difference between "a(?!b)" and "a[^b]"'? In case of a look around (there's also negative and positive look behind and ahead, which is all called look around) the part inside the look around is not "consumed" by the regex engine. Here's an example: if you have a string "zzazz" and you match is against the pattern "a(?!b)", then only the 'a' is matched, while matching it against "a[^b]", the substring "az" is matched.

Hope that clears things up.
javinto
Forum Newbie
Posts: 3
Joined: Thu Dec 04, 2008 3:54 pm

Re: matches unless ...

Post by javinto »

Wow, thanks guys for the explanations. I did not realize how the regexp grouping actually works.
I suspect I will need those look-ahead functions more often.

Thanks
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: matches unless ...

Post by prometheuzz »

javinto wrote:Thanks! Just did a little test and it works! ...
Of course it works!
; )

In case you need it, a short explanation:

Code: Select all

\bmoney\b    // match 'money' surrounded by word boundaries
(?!          // start negative look ahead
  [^<]*      //   matches zero or more characters of any type except '<'
  </         //   matches the string '</'
)            // stop look ahead
So, in plain English, the regex would read like this: "match the word 'money' surrounded with word boundaries only if there isn't a substring '</' in front of it (ahead) with zero or more characters of any type except '<' in between it".
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: matches unless ...

Post by prometheuzz »

javinto wrote:Wow, thanks guys for the explanations. I did not realize how the regexp grouping actually works.
I suspect I will need those look-ahead functions more often.

Thanks
All about look arounds: http://www.regular-expressions.info/lookaround.html

Note that the entire site is an excellent online resource!
Post Reply