Page 1 of 1
Regex Expression to match characters not inside of links
Posted: Wed Apr 22, 2009 4:37 pm
by Benjamin
I'm currently searching google looking for regex that will match text that is NOT inside of anchor tags. I'm not familiar with the negation operators. If anyone can post the expression for me that would be great.
EDIT: This kind of works:
Code: Select all
<\s{0,2}[^a/].*?>(foo[\'s]{0,2})<\s{0,2}[^>].*?>|[^>]\s{0,2}(foo[\'s]{0,2})\s{0,2}[^<]\s{0,2}[^/]
The problem is that it matches (foo)'s instead of (foo's). It needs to match words with 's at the end as well.
Re: Regex Expression to match characters not inside of links
Posted: Wed Apr 22, 2009 5:30 pm
by Christopher
I think you want to make your pattern a sub-pattern and then negate it using the (?!subpattern) syntax.
Re: Regex Expression to match characters not inside of links
Posted: Wed Apr 22, 2009 11:56 pm
by prometheuzz
Benjamin wrote:I'm currently searching google looking for regex that will match text that is NOT inside of anchor tags. I'm not familiar with the negation operators. If anyone can post the expression for me that would be great.
EDIT: This kind of works:
Code: Select all
<\s{0,2}[^a/].*?>(foo[\'s]{0,2})<\s{0,2}[^>].*?>|[^>]\s{0,2}(foo[\'s]{0,2})\s{0,2}[^<]\s{0,2}[^/]
The problem is that it matches (foo)'s instead of (foo's). It needs to match words with 's at the end as well.
It is unclear to me what it is you're trying to match. Can you give a couple of examples for clarity?
Re: Regex Expression to match characters not inside of links
Posted: Thu Apr 23, 2009 12:48 am
by Benjamin
I would like to match:
<i>foo</i>
foo
<b>foo</b>
<i>foo's</i>
foo's
But NOT match
<a href="">foo</a>
And it would be great if it went further and didn't match:
<a href=""><i>foo</i></a>
So, essentially it needs to match anything that isn't in an anchor tag.
Re: Regex Expression to match characters not inside of links
Posted: Thu Apr 23, 2009 12:49 am
by Benjamin
arborint wrote:I think you want to make your pattern a sub-pattern and then negate it using the (?!subpattern) syntax.
Possibly, it may be possible without it.
Re: Regex Expression to match characters not inside of links
Posted: Thu Apr 23, 2009 1:21 am
by prometheuzz
Benjamin wrote:I would like to match:
<i>foo</i>
foo
<b>foo</b>
<i>foo's</i>
foo's
But NOT match
<a href="">foo</a>
And it would be great if it went further and didn't match:
<a href=""><i>foo</i></a>
So, essentially it needs to match anything that isn't in an anchor tag.
This regex matches all your examples including strings like "<b><i>foo's</i></b>":
Code: Select all
(?:<[^a/][^>]*>)*foo(?:'s)?(</[^a]>)*(?!</)
Re: Regex Expression to match characters not inside of links
Posted: Thu Apr 23, 2009 1:59 am
by Benjamin
That's perfect. Thank you very much. Can you explain how it works? ie what does ?: and ?! do?
Re: Regex Expression to match characters not inside of links
Posted: Thu Apr 23, 2009 2:17 am
by prometheuzz
Benjamin wrote:That's perfect. Thank you very much. Can you explain how it works? ie what does ?: and ?! do?
(?:...) is a non-capturing-group. The regex engine will not group what is matched by it in $1 (or \1) or some other variable. It makes your regex a bit faster. But if your strings are not large, you can leave it out for in favour of readability.
(?!...) is negative look ahead. Example: "a(?!b)" will match an 'a' only if not followed by a 'b'.
A (short) explanation of the entire regex:
Code: Select all
(?: // open non-capturing group 1
<[^a/][^>]*> // match any opening tag except an opening anchor
) // close non-capturing group 1
* // group 1, zero or more times
foo // match "foo"
(?: // open non-capturing group 2
's // match "'s"
) // close non-capturing group 2
? // group 2, zero or one time
( // open non-capturing group 3
</[^a]> // match any closing tag except a closing anchor
) // close non-capturing group 3
* // group 3, zero or more times
(?! // start negative look ahead
</ // match "</"
) // stop negative look ahead
So, in plain English this would be:
Match as many opening tags (other than anchor tags) as possible, followed by
either "foo" or "foo's", followed by as many closing tags (other than anchor
tags) as possible. When the regex is done matching, the end of the string
should NOT be followed by the string "</" (the negative look-ahead).
HTH
Re: Regex Expression to match characters not inside of links
Posted: Thu Apr 23, 2009 2:44 pm
by Benjamin
Ok, I need to make some minor changes to the expression, but no matter what I try it either doesn't work, or breaks it.
The regex will match foo when it's inside of a tag, and also when it's part of a word. I need to modify it so that it will not match the following:
<a href="
http://www.domain.com/[b]foo[/b]">some text</a>
and
x
foox
Re: Regex Expression to match characters not inside of links
Posted: Thu Apr 23, 2009 8:11 pm
by Benjamin
I'm not sure regex is the best solution for this. It might be better to use preg_split to filter out the links and then have some simple RegEx to do the replacements. Otherwise, only a RegEx guru will be able to modify the code in the future.
Re: Regex Expression to match characters not inside of links
Posted: Fri Apr 24, 2009 1:04 am
by prometheuzz
Benjamin wrote:I'm not sure regex is the best solution for this.
...
I agree.
But, in case you were curious, here's how to account for the other two cases:
Code: Select all
(?:<[^a/][^>]*>)*\bfoo\b(?:'s)?(</[^a]>)*(?!</|[^<>]*>)
Re: Regex Expression to match characters not inside of links
Posted: Fri Apr 24, 2009 1:12 pm
by Benjamin
That's doing the same thing it was doing for me, when you add the \b it starts matching text inside of links again.
<a href="...">foo</a>
So what I did is used preg_split to split it on the links, then preg_replace on each chunk that wasn't a link. I think that should work well.
Re: Regex Expression to match characters not inside of links
Posted: Sat Apr 25, 2009 2:07 am
by prometheuzz
Benjamin wrote:That's doing the same thing it was doing for me, ...
It works fine as far as I can tell:
Code: Select all
$text = "<i>foo</i>
foo
<b>foo</b>
<i>foo's</i>
foo's
<b><i>foo's</i></b>
<a href=\"\">foo</a>
<a href=\"\"><i>foo</i></a>
<a href=\"http://www.domain.com/foo\">some text</a>
xfoox";
preg_match_all("@(?:<[^a/][^>]*>)*\bfoo\b(?:'s)?(?:</[^a]>)*(?!</|[^<>]*>)@", $text, $matches);
print_r($matches);
/* output:
Array
(
[0] => Array
(
[0] => <i>foo</i>
[1] => foo
[2] => <b>foo</b>
[3] => <i>foo's</i>
[4] => foo's
[5] => <b><i>foo's</i></b>
)
)
*/
Re: Regex Expression to match characters not inside of links
Posted: Sat Apr 25, 2009 2:11 am
by Benjamin
Hmm, I was testing it in Kiki... Maybe it has a bug.

Re: Regex Expression to match characters not inside of links
Posted: Thu Jul 26, 2012 1:09 am
by Benjamin
Ha, didn't think I would end up back here.
I'm building a new system that requires very similar functionality. All I need to know is how to modify this so that I can either include OR exclude strings inside of double quotes.
Code: Select all
'#(?:<[^a/][^>]*>)*\bMATCH STRING\b(?:\'s)?(</[^a]>)*(?!</|[^<>]*>)#i'
So essentially I need two versions. One that will match the following:
[...]"MATCH STRING"[...] but not [...]<a href="">"MATCH STRING"</a>[...]
And another that will NOT match the following
[...]"MATCH STRING"[...]