Regex Expression to match characters not inside of links

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Regex Expression to match characters not inside of links

Post by Benjamin »

I'm currently searching google looking for regex that will match text that is NOT inside of anchor tags. I'm not familiar with the negation operators. If anyone can post the expression for me that would be great.

EDIT: This kind of works:

Code: Select all

 
<\s{0,2}[^a/].*?>(foo[\'s]{0,2})<\s{0,2}[^>].*?>|[^>]\s{0,2}(foo[\'s]{0,2})\s{0,2}[^<]\s{0,2}[^/]
 
The problem is that it matches (foo)'s instead of (foo's). It needs to match words with 's at the end as well.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Regex Expression to match characters not inside of links

Post by Christopher »

I think you want to make your pattern a sub-pattern and then negate it using the (?!subpattern) syntax.
(#10850)
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Regex Expression to match characters not inside of links

Post by prometheuzz »

Benjamin wrote:I'm currently searching google looking for regex that will match text that is NOT inside of anchor tags. I'm not familiar with the negation operators. If anyone can post the expression for me that would be great.

EDIT: This kind of works:

Code: Select all

 
<\s{0,2}[^a/].*?>(foo[\'s]{0,2})<\s{0,2}[^>].*?>|[^>]\s{0,2}(foo[\'s]{0,2})\s{0,2}[^<]\s{0,2}[^/]
 
 
The problem is that it matches (foo)'s instead of (foo's). It needs to match words with 's at the end as well.
It is unclear to me what it is you're trying to match. Can you give a couple of examples for clarity?
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Regex Expression to match characters not inside of links

Post by Benjamin »

I would like to match:

<i>foo</i>
foo
<b>foo</b>
<i>foo's</i>
foo's

But NOT match

<a href="">foo</a>

And it would be great if it went further and didn't match:

<a href=""><i>foo</i></a>

So, essentially it needs to match anything that isn't in an anchor tag.
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Regex Expression to match characters not inside of links

Post by Benjamin »

arborint wrote:I think you want to make your pattern a sub-pattern and then negate it using the (?!subpattern) syntax.
Possibly, it may be possible without it.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Regex Expression to match characters not inside of links

Post by prometheuzz »

Benjamin wrote:I would like to match:

<i>foo</i>
foo
<b>foo</b>
<i>foo's</i>
foo's

But NOT match

<a href="">foo</a>

And it would be great if it went further and didn't match:

<a href=""><i>foo</i></a>

So, essentially it needs to match anything that isn't in an anchor tag.
This regex matches all your examples including strings like "<b><i>foo's</i></b>":

Code: Select all

(?:<[^a/][^>]*>)*foo(?:'s)?(</[^a]>)*(?!</)
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Regex Expression to match characters not inside of links

Post by Benjamin »

That's perfect. Thank you very much. Can you explain how it works? ie what does ?: and ?! do?
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Regex Expression to match characters not inside of links

Post by prometheuzz »

Benjamin wrote:That's perfect. Thank you very much. Can you explain how it works? ie what does ?: and ?! do?
(?:...) is a non-capturing-group. The regex engine will not group what is matched by it in $1 (or \1) or some other variable. It makes your regex a bit faster. But if your strings are not large, you can leave it out for in favour of readability.

(?!...) is negative look ahead. Example: "a(?!b)" will match an 'a' only if not followed by a 'b'.

A (short) explanation of the entire regex:

Code: Select all

(?:               // open non-capturing group 1
  <[^a/][^>]*>    //   match any opening tag except an opening anchor
)                 // close non-capturing group 1
*                 // group 1, zero or more times
foo               // match "foo"
(?:               // open non-capturing group 2
  's              //   match "'s"
)                 // close non-capturing group 2
?                 // group 2, zero or one time
(                 // open non-capturing group 3
  </[^a]>         //   match any closing tag except a closing anchor
)                 // close non-capturing group 3
*                 // group 3, zero or more times
(?!               // start negative look ahead
  </              //   match "</"
)                 // stop negative look ahead
So, in plain English this would be:

Match as many opening tags (other than anchor tags) as possible, followed by
either "foo" or "foo's", followed by as many closing tags (other than anchor
tags) as possible. When the regex is done matching, the end of the string
should NOT be followed by the string "</" (the negative look-ahead).


HTH
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Regex Expression to match characters not inside of links

Post by Benjamin »

Ok, I need to make some minor changes to the expression, but no matter what I try it either doesn't work, or breaks it.

The regex will match foo when it's inside of a tag, and also when it's part of a word. I need to modify it so that it will not match the following:

<a href="http://www.domain.com/[b]foo[/b]">some text</a>

and

xfoox
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Regex Expression to match characters not inside of links

Post by Benjamin »

I'm not sure regex is the best solution for this. It might be better to use preg_split to filter out the links and then have some simple RegEx to do the replacements. Otherwise, only a RegEx guru will be able to modify the code in the future.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Regex Expression to match characters not inside of links

Post by prometheuzz »

Benjamin wrote:I'm not sure regex is the best solution for this.
...
I agree.

But, in case you were curious, here's how to account for the other two cases:

Code: Select all

(?:<[^a/][^>]*>)*\bfoo\b(?:'s)?(</[^a]>)*(?!</|[^<>]*>)
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Regex Expression to match characters not inside of links

Post by Benjamin »

That's doing the same thing it was doing for me, when you add the \b it starts matching text inside of links again.

<a href="...">foo</a>

So what I did is used preg_split to split it on the links, then preg_replace on each chunk that wasn't a link. I think that should work well.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Regex Expression to match characters not inside of links

Post by prometheuzz »

Benjamin wrote:That's doing the same thing it was doing for me, ...
It works fine as far as I can tell:

Code: Select all

$text = "<i>foo</i>
foo
<b>foo</b>
<i>foo's</i>
foo's
<b><i>foo's</i></b>
<a href=\"\">foo</a>
<a href=\"\"><i>foo</i></a>
<a href=\"http://www.domain.com/foo\">some text</a>
xfoox";
 
preg_match_all("@(?:<[^a/][^>]*>)*\bfoo\b(?:'s)?(?:</[^a]>)*(?!</|[^<>]*>)@", $text, $matches);
 
print_r($matches);
 
/* output:
Array
(
    [0] => Array
        (
            [0] => <i>foo</i>
            [1] => foo
            [2] => <b>foo</b>
            [3] => <i>foo's</i>
            [4] => foo's
            [5] => <b><i>foo's</i></b>
        )
 
)
*/
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Regex Expression to match characters not inside of links

Post by Benjamin »

Hmm, I was testing it in Kiki... Maybe it has a bug. :(
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Regex Expression to match characters not inside of links

Post by Benjamin »

Ha, didn't think I would end up back here.

I'm building a new system that requires very similar functionality. All I need to know is how to modify this so that I can either include OR exclude strings inside of double quotes.

Code: Select all

'#(?:<[^a/][^>]*>)*\bMATCH STRING\b(?:\'s)?(</[^a]>)*(?!</|[^<>]*>)#i'
So essentially I need two versions. One that will match the following:

[...]"MATCH STRING"[...] but not [...]<a href="">"MATCH STRING"</a>[...]

And another that will NOT match the following

[...]"MATCH STRING"[...]
Post Reply