filter help

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
netstorm
Forum Newbie
Posts: 5
Joined: Mon Jun 09, 2008 4:59 am

filter help

Post by netstorm »

Ok, so I'm trying to create a filter and my problem is with word boundaries and escaping characters.

I have a $sentence i look through and a $word I'm trying to match. The $word can be anything, but it's not an expression so i escape regex operators: $word = addcslashes($word,".[]()^$/*+|"); Everything works fine untill now and the following example returns what's expected:

Code: Select all

 
$word="[xyz]*";
$sentence1="[xyz]*";
$sentence2="xyz";
 
$word = addcslashes($word,".[]()^$/*+|");  // $word becomes "/[xyz/]/*"
 
preg_match("/$word/i", $sentence1); //returns true 
preg_match("/$word/i", $sentence2); //returns false
 
That's exactly what I want it to do, the problems start when i add word boundaries:

Code: Select all

 
$word="[xyz]*";
$sentence1="[xyz]*";
$sentence2="xyz";
 
$word = addcslashes($word,".[]()^$/*+|");  // $word becomes "/[xyz/]/*"
 
preg_match("/\b$word\b/i", $sentence1); //returns false <--problem!!
preg_match("/\b$word\b/i", $sentence2); //returns false
 
Can anyone please tell me what I'm doing wrong? :(
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: filter help

Post by prometheuzz »

A word boundary (\b) does not match the start, or end of your string. So, try to use this instead:

Code: Select all

'/(\b|^)$word(\b|$)/'
Also, I presume that your "escaping method" returns "\[xyz\]\*" instead of "/[xyz/]/*". Did you know that the \Q will cause the regex to ignore all the meta characters so you don't need to escape them. So, try something like this:

Code: Select all

'/(\b|^)\Q[xyz]*(\b|$)/'
HTH
netstorm
Forum Newbie
Posts: 5
Joined: Mon Jun 09, 2008 4:59 am

Re: filter help

Post by netstorm »

Thanks for the fast reply :D
prometheuzz wrote: I presume that your "escaping method" returns "\[xyz\]\*" instead of "/[xyz/]/*".
Yes, it does, It was a typo :)
prometheuzz wrote: Did you know that the \Q will cause the regex to ignore all the meta characters so you don't need to escape them. So, try something like this:

Code: Select all

'/(\b|^)\Q[xyz]*(\b|$)/'
HTH
I didn't know about \Q, but i tried what you said and it didn't work...

Code: Select all

preg_match('/(\b|^)\Q[xyz]*(\b|$)/', '[xyz]*') //returns false
Help again? :)
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: filter help

Post by prometheuzz »

netstorm wrote:Thanks for the fast reply :D

...

I didn't know about \Q, but i tried what you said and it didn't work...

Code: Select all

preg_match('/(\b|^)\Q[xyz]*(\b|$)/', '[xyz]*') //returns false
Help again? :)
Sorry, I forgot to mention you need to end the \Q, otherwise the entire regex after the \Q will be matched "as is", so this is really what I meant:

Code: Select all

preg_match('/(\b|^)\Q[xyz]*\E(\b|$)/', '[xyz]*')
I have no PHP interpreter at my disposal at the moment, but it should be ok.
;)

HTH
netstorm
Forum Newbie
Posts: 5
Joined: Mon Jun 09, 2008 4:59 am

Re: filter help

Post by netstorm »

Thanks! Well, it half works :D

Code: Select all

 
preg_match('/(\b|^)\Q[xyz]*\E(\b|$)/', '[xyz]*'); //returns true
preg_match('/(\b|^)\Q[xyz]*\E(\b|$)/', 'smtg [xyz]* smtg'); //returns false
 
Trying to fix it myself, I noticed that it matches 'b[xyz]*b' so the (\b|$) somehow gives the character 'b' and not the operator '\b' for word boundary. Any ideas?

Code: Select all

 
preg_match('/(\b|^)\Q[xyz]*\E(\b|$)/', 'b[xyz]*b'); //returns true
 
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: filter help

Post by prometheuzz »

When testing the following:

Code: Select all

preg_match('/\b\Q[xyz]*\E\b/i', 'smtg [xyz]* sm[xyz]*tg', $result)
on http://regex.larsolavtorvik.com/

$result is printed as:

Code: Select all

Array
(
    [0] => [xyz]*
)
netstorm
Forum Newbie
Posts: 5
Joined: Mon Jun 09, 2008 4:59 am

Re: filter help

Post by netstorm »

Yeah, it matches the second occurence, not the first one :D. So instead of getting me the separate word, it gets it only if it's inside the word :|.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: filter help

Post by prometheuzz »

w.r.t. my previous reply:

Now all of a sudden, the tool I posted in my previous reply gives something different... I'll have a look at this when I get home and can actually test the stuff I post here.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: filter help

Post by prometheuzz »

netstorm wrote:Yeah, it matches the second occurence, not the first one :D. So instead of getting me the separate word, it gets it only if it's inside the word :|.
Wait, it's because of the fact the the '[', ']' and '*' are word boundaries themselves. You could solve it using look around.

Both:

Code: Select all

preg_match('/(?<=^|[\s])\Q[xyz]*\E(?=[\s]|$)/i', '[xyz]*', $result);
# and
preg_match('/(?<=^|[\s])\Q[xyz]*\E(?=[\s]|$)/i', 'aaa [xyz]* aaa', $result);
should evaluate to true and match "[xyz]*".

Of course, you could expand the character class [\s] by adding punctuation marks to it.
netstorm
Forum Newbie
Posts: 5
Joined: Mon Jun 09, 2008 4:59 am

Re: filter help

Post by netstorm »

They do and thank you very much for all your help! :D
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: filter help

Post by prometheuzz »

netstorm wrote:They do and thank you very much for all your help! :D
Good to hear it, and you're welcome!
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: filter help

Post by GeertDD »

prometheuzz wrote:A word boundary (\b) does not match the start, or end of your string.
It does. Well, it does when there is a word boundary.

Code: Select all

preg_match('~\bhello~', 'hello'); // returns (int)1
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: filter help

Post by prometheuzz »

GeertDD wrote:
prometheuzz wrote:A word boundary (\b) does not match the start, or end of your string.
It does. Well, it does when there is a word boundary.

Code: Select all

preg_match('~\bhello~', 'hello'); // returns (int)1
You're right of course, I was a bit confused because of the fact that the string to match had word boundaries in it. Good to have it on the record!

Well Geert, that's what happens if you leave me alone in here too long!
; )
Post Reply