PHP Developers Network

A community of PHP developers offering assistance, advice, discussion, and friendship.
 
Loading
It is currently Fri Nov 16, 2018 9:56 am

All times are UTC - 5 hours




Post new topic Reply to topic  [ 15 posts ] 
Author Message
PostPosted: Wed Apr 22, 2009 4:37 pm 
Offline
Site Administrator
User avatar

Joined: Sun May 19, 2002 10:24 pm
Posts: 6887
I'm currently searching google looking for regex that will match text that is NOT inside of anchor tags. I'm not familiar with the negation operators. If anyone can post the expression for me that would be great.

EDIT: This kind of works:

Syntax: [ Download ] [ Hide ]
 
<\s{0,2}[^a/].*?>(foo[\'s]{0,2})<\s{0,2}[^>].*?>|[^>]\s{0,2}(foo[\'s]{0,2})\s{0,2}[^<]\s{0,2}[^/]
 


The problem is that it matches (foo)'s instead of (foo's). It needs to match words with 's at the end as well.

_________________
Image


Top
 Profile  
 
PostPosted: Wed Apr 22, 2009 5:30 pm 
Offline
Site Administrator
User avatar

Joined: Wed Aug 25, 2004 7:54 pm
Posts: 13583
Location: New York, NY, US
I think you want to make your pattern a sub-pattern and then negate it using the (?!subpattern) syntax.

_________________
(#10850)


Top
 Profile  
 
PostPosted: Wed Apr 22, 2009 11:56 pm 
Offline
Forum Regular
User avatar

Joined: Fri Apr 04, 2008 5:51 am
Posts: 779
Benjamin wrote:
I'm currently searching google looking for regex that will match text that is NOT inside of anchor tags. I'm not familiar with the negation operators. If anyone can post the expression for me that would be great.

EDIT: This kind of works:

Syntax: [ Download ] [ Hide ]
 
<\s{0,2}[^a/].*?>(foo[\'s]{0,2})<\s{0,2}[^>].*?>|[^>]\s{0,2}(foo[\'s]{0,2})\s{0,2}[^<]\s{0,2}[^/]
 
 


The problem is that it matches (foo)'s instead of (foo's). It needs to match words with 's at the end as well.


It is unclear to me what it is you're trying to match. Can you give a couple of examples for clarity?


Top
 Profile  
 
PostPosted: Thu Apr 23, 2009 12:48 am 
Offline
Site Administrator
User avatar

Joined: Sun May 19, 2002 10:24 pm
Posts: 6887
I would like to match:

<i>foo</i>
foo
<b>foo</b>
<i>foo's</i>
foo's

But NOT match

<a href="">foo</a>

And it would be great if it went further and didn't match:

<a href=""><i>foo</i></a>

So, essentially it needs to match anything that isn't in an anchor tag.

_________________
Image


Top
 Profile  
 
PostPosted: Thu Apr 23, 2009 12:49 am 
Offline
Site Administrator
User avatar

Joined: Sun May 19, 2002 10:24 pm
Posts: 6887
arborint wrote:
I think you want to make your pattern a sub-pattern and then negate it using the (?!subpattern) syntax.


Possibly, it may be possible without it.

_________________
Image


Top
 Profile  
 
PostPosted: Thu Apr 23, 2009 1:21 am 
Offline
Forum Regular
User avatar

Joined: Fri Apr 04, 2008 5:51 am
Posts: 779
Benjamin wrote:
I would like to match:

<i>foo</i>
foo
<b>foo</b>
<i>foo's</i>
foo's

But NOT match

<a href="">foo</a>

And it would be great if it went further and didn't match:

<a href=""><i>foo</i></a>

So, essentially it needs to match anything that isn't in an anchor tag.


This regex matches all your examples including strings like "<b><i>foo's</i></b>":

Syntax: [ Download ] [ Hide ]
(?:<[^a/][^>]*>)*foo(?:'s)?(</[^a]>)*(?!</)


Top
 Profile  
 
PostPosted: Thu Apr 23, 2009 1:59 am 
Offline
Site Administrator
User avatar

Joined: Sun May 19, 2002 10:24 pm
Posts: 6887
That's perfect. Thank you very much. Can you explain how it works? ie what does ?: and ?! do?

_________________
Image


Top
 Profile  
 
PostPosted: Thu Apr 23, 2009 2:17 am 
Offline
Forum Regular
User avatar

Joined: Fri Apr 04, 2008 5:51 am
Posts: 779
Benjamin wrote:
That's perfect. Thank you very much. Can you explain how it works? ie what does ?: and ?! do?


(?:...) is a non-capturing-group. The regex engine will not group what is matched by it in $1 (or \1) or some other variable. It makes your regex a bit faster. But if your strings are not large, you can leave it out for in favour of readability.

(?!...) is negative look ahead. Example: "a(?!b)" will match an 'a' only if not followed by a 'b'.

A (short) explanation of the entire regex:

Syntax: [ Download ] [ Hide ]
(?:               // open non-capturing group 1
  <[^a/][^>]*>    //   match any opening tag except an opening anchor
)                 // close non-capturing group 1
*                 // group 1, zero or more times
foo               // match "foo"
(?:               // open non-capturing group 2
  's              //   match "'s"
)                 // close non-capturing group 2
?                 // group 2, zero or one time
(                 // open non-capturing group 3
  </[^a]>         //   match any closing tag except a closing anchor
)                 // close non-capturing group 3
*                 // group 3, zero or more times
(?!               // start negative look ahead
  </              //   match "
</"
)                 // stop negative look ahead


So, in plain English this would be:

Match as many opening tags (other than anchor tags) as possible, followed by
either "foo" or "foo's", followed by as many closing tags (other than anchor
tags) as possible. When the regex is done matching, the end of the string
should NOT be followed by the string "</" (the negative look-ahead).


HTH


Top
 Profile  
 
PostPosted: Thu Apr 23, 2009 2:44 pm 
Offline
Site Administrator
User avatar

Joined: Sun May 19, 2002 10:24 pm
Posts: 6887
Ok, I need to make some minor changes to the expression, but no matter what I try it either doesn't work, or breaks it.

The regex will match foo when it's inside of a tag, and also when it's part of a word. I need to modify it so that it will not match the following:

<a href="http://www.domain.com/foo">some text</a>

and

xfoox

_________________
Image


Top
 Profile  
 
PostPosted: Thu Apr 23, 2009 8:11 pm 
Offline
Site Administrator
User avatar

Joined: Sun May 19, 2002 10:24 pm
Posts: 6887
I'm not sure regex is the best solution for this. It might be better to use preg_split to filter out the links and then have some simple RegEx to do the replacements. Otherwise, only a RegEx guru will be able to modify the code in the future.

_________________
Image


Top
 Profile  
 
PostPosted: Fri Apr 24, 2009 1:04 am 
Offline
Forum Regular
User avatar

Joined: Fri Apr 04, 2008 5:51 am
Posts: 779
Benjamin wrote:
I'm not sure regex is the best solution for this.
...


I agree.

But, in case you were curious, here's how to account for the other two cases:

Syntax: [ Download ] [ Hide ]
(?:<[^a/][^>]*>)*\bfoo\b(?:'s)?(</[^a]>)*(?!</|[^<>]*>)


Top
 Profile  
 
PostPosted: Fri Apr 24, 2009 1:12 pm 
Offline
Site Administrator
User avatar

Joined: Sun May 19, 2002 10:24 pm
Posts: 6887
That's doing the same thing it was doing for me, when you add the \b it starts matching text inside of links again.

<a href="...">foo</a>

So what I did is used preg_split to split it on the links, then preg_replace on each chunk that wasn't a link. I think that should work well.

_________________
Image


Top
 Profile  
 
PostPosted: Sat Apr 25, 2009 2:07 am 
Offline
Forum Regular
User avatar

Joined: Fri Apr 04, 2008 5:51 am
Posts: 779
Benjamin wrote:
That's doing the same thing it was doing for me, ...


It works fine as far as I can tell:

Syntax: [ Download ] [ Hide ]
$text = "<i>foo</i>
foo
<b>foo</b>
<i>foo's</i>
foo's
<b><i>foo's</i></b>
<a href=\"\">foo</a>
<a href=\"\"><i>foo</i></a>
<a href=\"http://www.domain.com/foo\">some text</a>
xfoox"
;
 
preg_match_all("@(?:<[^a/][^>]*>)*\bfoo\b(?:'s)?(?:</[^a]>)*(?!</|[^<>]*>)@", $text, $matches);
 
print_r($matches);
 
/* output:
Array
(
    [0] => Array
        (
            [0] => <i>foo</i>
            [1] => foo
            [2] => <b>foo</b>
            [3] => <i>foo's</i>
            [4] => foo's
            [5] => <b><i>foo's</i></b>
        )
 
)
*/


Top
 Profile  
 
PostPosted: Sat Apr 25, 2009 2:11 am 
Offline
Site Administrator
User avatar

Joined: Sun May 19, 2002 10:24 pm
Posts: 6887
Hmm, I was testing it in Kiki... Maybe it has a bug. :(

_________________
Image


Top
 Profile  
 
PostPosted: Thu Jul 26, 2012 1:09 am 
Offline
Site Administrator
User avatar

Joined: Sun May 19, 2002 10:24 pm
Posts: 6887
Ha, didn't think I would end up back here.

I'm building a new system that requires very similar functionality. All I need to know is how to modify this so that I can either include OR exclude strings inside of double quotes.

Syntax: [ Download ] [ Hide ]
  1. '#(?:<[^a/][^>]*>)*\bMATCH STRING\b(?:\'s)?(</[^a]>)*(?!</|[^<>]*>)#i' 


So essentially I need two versions. One that will match the following:

[...]"MATCH STRING"[...] but not [...]<a href="">"MATCH STRING"</a>[...]

And another that will NOT match the following

[...]"MATCH STRING"[...]

_________________
Image


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 15 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group