Regular Expressions question

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
visionmaster
Forum Contributor
Posts: 139
Joined: Wed Jul 14, 2004 4:06 am

Regular Expressions question

Post by visionmaster »

Hello,

I always get 'A match was found.' What am I doing wrong?
I really just want to match distinct words. I should actually get a 'A match was NOT found.', but I don't.

"Meller Str. 33 · 49082 Osnabrueck"
->should match
"Meller Str. 33 · 49082 Osnabrueck"
->should match
"Meller Str. 33 · 490821 Osnabrueck"
->should not match

Probably a small mistake with a big impact...

Code: Select all

$arrDaten['Plz'] = '49082';
  	    $arrDaten['Ort'] = 'Osnabrueck';	
		/* The \b in the pattern indicates a word boundary, so only the distinct
		* word "web" is matched, and not a word partial like "webbing" or "cobweb" */	
		$pattern = "'|\b".preg_quote($arrDaten['Plz'])."\b\s+\b".preg_quote($arrDaten['Ort'])."\b|i'"; 	
	    $value = "Meller Str. 33 · 490821 Osnabrueck";
		echo $pattern;
		if (preg_match( $pattern, $value)) 
		{		
			echo "A match was found.";
		}
		else
		{
	  	    echo "A match was NOT found.";
		}
rehfeld
Forum Regular
Posts: 741
Joined: Mon Oct 18, 2004 8:14 pm

Post by rehfeld »

i dont know why your using those single quotes next to the delimiter in your pattern, but try it without them

Code: Select all

$pattern = "|\b".preg_quote($arrDaten['Plz'])."\b\s+\b".preg_quote($arrDaten['Ort'])."\b|i";

i think your trying to use the pipe char as your delimiter, but the regex is using your single quote because it comes first

so the pipe char becomes a branch operator. since the only thing before the first branch is nothing, its might be matching "nothing". since regex are eager to match, its first way to complete the match is to match nothing, which it does and then doesnt even try to match the other branches

just my theory though :)
visionmaster
Forum Contributor
Posts: 139
Joined: Wed Jul 14, 2004 4:06 am

Post by visionmaster »

Hi rehfeld,
i think your trying to use the pipe char as your delimiter, but the regex is using your single quote because it comes first

so the pipe char becomes a branch operator. since the only thing before the first branch is nothing, its might be matching "nothing". since regex are eager to match, its first way to complete the match is to match nothing, which it does and then doesnt even try to match the other branches

just my theory though :)
Thanks for your hint and explanation, your absolutely right! This is how it works:

$pattern = "|\b".preg_quote($arrDaten['Plz'])."\b\s+\b".preg_quote($arrDaten['Ort'])."\b|i";
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

When writing regular expressions, it is wise to try to use a pattern delimiter that is not a special token, and doesn't appear in the pattern.

You use | as the delimiter. | is a special regexp token, which may lead to confusion when reading the pattern later.

Furthermore it is extremely common to use / as the delimiter. (Probably because that seems to be very common over at Perl. Don't know if it's the only one they use.) But I would encourage anyone NOT to use / as delimiter when writing regexp for XML, URL or unix-path stuff, since all these things often include / in the string we want to match.

Consider three cases:

Code: Select all

Match tag or end tag:
/<\/?&#1111;a-z]+&#1111;^>]*?\/?>/i

Match climbing path:
/\/&#1111;^/]*\/\.\.\//

Match &#1111;protocol]://:
/&#1111;a-z]+:\/\//i
Wouldn't it be a lot nicer to use some other delimiter that doesn't appear in the pattern?

Code: Select all

Match tag or end tag:
%</?&#1111;a-z]+&#1111;^>]*?/?>%i

Match climbing path:
%/&#1111;^/]*/\.\./%

Match &#1111;protocol]://:
%&#1111;a-z]+://%i
I am not sure those patterns of mine actually work as intended, but my point is readability.
Last edited by Heavy on Tue Jun 28, 2005 3:45 am, edited 2 times in total.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Regexps are notoriously difficult to read. :P
User avatar
Skara
Forum Regular
Posts: 703
Joined: Sat Mar 12, 2005 7:13 pm
Location: US

Post by Skara »

There not so hard to read if you do what Heavy said. ;)
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

Hehe, thanks!
You know, regexp can really bite you.
As soon as you get the hang of it, you start visionizing about changing all the editors and all the parsers in the world... :wink:
Post Reply