PHP Developers Network

A community of PHP developers offering assistance, advice, discussion, and friendship.
 
Loading
It is currently Tue Jul 25, 2017 9:39 pm

All times are UTC - 5 hours




Post new topic Reply to topic  [ 12 posts ] 
Author Message
PostPosted: Tue Nov 01, 2005 6:33 pm 
Offline
Breakbeat Nuttzer
User avatar

Joined: Wed Mar 24, 2004 8:57 am
Posts: 13098
Location: Melbourne, Australia
About time I got around to doing this :)

If you didn't read the crash course before you jumped into this tutorial it may be a good idea to do so unless you already have a grasp of regex basics.

A few sites I need to point out (again):
http://www.regular-expressions.info/ (Tutorials for regex in a few programming languages)
http://www.perl.com/doc/manual/html/pod/perlre.html (Perl Documentation for TRUE perl style regex)
http://www.weitz.de/regex-coach/ (Fantastic application called Regex-Coach!)

From this point on, we might as well consider everything written to be perl-style regex since the rest don't get particularly advanced.

A quick re-cap:

In the super-speedy paced crash course we looked at the metacharacters, the quantifiers, the modifiers and some PHP functions.

That gave us enough information to start constructing some simple regex. However, even with that basic knowledge you may still hit a few hurdles trying to do some fancy things with regex.

Ready? We're off!

How does the regex engine work?

In a pure technical sense I really wouldn't have a clue. But in a conceptual sense it works like this...

The regex engine reads the regex as it reads the string it's checking against. If the regex engine is satisfied that everything in the pattern has been matched it does not look any further into the remaining string (without modifiers).

By default... the regex engine will try to match *everything* the pattern tells it to match.

Quantifiers can change the behaviour of the regex engine quite significantly and can cuase hours of confusion and annoyance among developers. This is to do with something we call "greediness" in regex terms. Lets look at this more closely.

Pattern Greediness in Regular Expressions:

If you use a quantifier which allows matching of characters up to any number of times, the regex engine will try to fulfill that requirement as best it can.

String: Foo ###123 bar
Syntax: [ Download ] [ Hide ]
/^[a-z]+.*(\d+)/i


The above regex, to anybody not looking closely appears to extract the "123" from the string. In actual fact, it does not do this...

Syntax: [ Download ] [ Hide ]
Array
(
    [0] => Foo ###123
   [1] => 3
)


So what happened?

"[a-z]+" .. OK that's good
".*" .. Any character any number of times. This is whaere it collapses.

The dot-star combination is the evilest of the greedy patterns because it really will just match everything it can (except newline chars, without the "s" modifier).

Foo was picked up by the character class [a-z] since our pattern used the "i" modifier. The .* consumed the rest of our string less one number because the next metacharacter in the sequence used the "+" quantifier which allows at least one character to be matched.

So how do you fix that issue? -- Answer: You combine the greedy quantifier with the "?" quantifier. This makes that part of the pattern "ungreedy".

Syntax: [ Download ] [ Hide ]
/^[a-z]+.*?(\d+)/i


Produces

Syntax: [ Download ] [ Hide ]
Array
(
    [0] => Foo ###123
   [1] => 123
)


Essentially, we've told the regex engine to always check if the next part the pattern *can* feasibly match the following character.

Note: There is a "U" pattern modifier which makes the entire pattern ungreedy by default... use with caution!

From this point on... the tutorial is covering some advanced concepts. You'll only really need to use this stuff when you are writing very long patterns etc but anyway...

Special commands:

Regex can do some really clever things using instructions in the middle of the pattern. The syntax for providing these instructions is

Syntax: [ Download ] [ Hide ]
(?instruction)


We'll bring this into play from here on.

Mid-pattern modifiers:

You've seen that you can modify the behaviour of a pattern by adding some letters after the closing delimiter. Brilliant! Guess what, we can twist and bend the behaviour of our regex mid-pattern by doing something similar ;) You'll like this.

The basic syntax is like this

Syntax: [ Download ] [ Hide ]
<<< Modify the part in parens to TURN ON the modifier >>>

(?i ... )
(?m ... )
(?s ... )
(?U ... )

<<< Modify the part in parens to TURN OFF the modifier >>>

(?-i ... )
(?-m ... )
(?-s ... )
(?-U ... )


Those letters, "i", "m" etc are the same pattern modifiers we used in the crash course... but now we can use them *inside* our pattern. This is only really handy if you have some very non-uniform string to match or you are writing a very long pattern.

An example usage:

String 1: Where IS the UK?
String 2: where is the UK?

Syntax: [ Download ] [ Hide ]
/[a-z\s]*?(?-iUK)/i


Lets say we always want UK to be in uppercase but the rest of the string is likely to have uppercase and lowercase characters in different places depending who typed it. We use the "i" modifier to account for the differences in the way people write... but what about UK being uppercase? Here we have disabled the "i" modifier for that specific part of the pattern so it matches uppercase UK specifically.

Lookaheads:

Lookaheads come in two flavours. Positive and negative. What they do, is check if a particular string follows part of the pattern. You wont often need these since you can normally just put the string itself into the pattern.

Syntax:
Syntax: [ Download ] [ Hide ]
pattern(?= ... )


In the above, "pattern" *must* be followed by whatever is after the "?=" in the parens.

String: Sunshine
Syntax: [ Download ] [ Hide ]
/[a-z]+(?=shine)/i


The above pattern matches the word "Sun"... it would also match the "Moon" in Moonshine.

Negative lookaheads mean that part of the pattern *must not* be followed by the lookahead.

Syntax:
Syntax: [ Download ] [ Hide ]
pattern(?! ... )


Example:
Syntax: [ Download ] [ Hide ]
/sun(?!shine)[a-z]*/i


The above pattern will match any word starting with "sun" but NOT starting with "sunshine".

Fixed-width Lookbehinds:

These can prove very useful. They have one drawback however. You need to know the size of whatever goes in the lookbehind due to the way the regex engine works. What they do is exactly the same as lookaheads except that they are looking backwards. The pattern it applies to must, or must not follow the lookbehind depending upon whether it is positive or negative.

Syntax for positive:
Syntax: [ Download ] [ Hide ]
(?<= ... )pattern


Notice that the lookbehind physically goes before the pattern?

Example positive lookbehind:
Syntax: [ Download ] [ Hide ]
/(?<=sun)[a-z]+/i


The above pattern will the end of any word starting with "sun", such as "shine" or "light".

Negative lookbehind syntax:
Syntax: [ Download ] [ Hide ]
(?<! ... )pattern


Example negative lookbehind:
Syntax: [ Download ] [ Hide ]
/\b(?<!a)[a-z]+/i


The above matches any word which does NOT start with a letter "a". The \b assertion just makes sure were at the start of a word.

Grouping with parens without extracting:

If you surround parts of your pattern in parens they will end up in backreferences as you've seen. These special commands with (? ... ) don't behave in that way however. There's a little command that simpy tells the regex engine to group characters, but not extract them.

Syntax:
Syntax: [ Download ] [ Hide ]
(?: ... )


Example:
Syntax: [ Download ] [ Hide ]
/Foo(?:bar)+/


That matches Foobar, Foobarbar Foobarbarbarbarbarbar ... etc etc. The advantages of using that little command are that you'll save a neglible amount of memory and speed up the matching slightly. In the real-world you'll use these a fair bit and they prove to be very handy at preveinting things from getting cluttered in larger patterns.

Extracting named backreferences:

This is nice and all, but it may confuse anyone who only knows pretty standard regex. It basically allows you to name all your backreferences (extracted parts) so that you can make more readable code.

Syntax:
Syntax: [ Download ] [ Hide ]
(?P<Name> ... )


That's an UPPERCASE "P" and those less-than/greater-than symbols really are supposed to be there!

Example:
Lets use our first one
String: Foo ###123 bar

Syntax: [ Download ] [ Hide ]
/^[a-z]+.*?(?P<thenumber>\d+)/i


This produces the following
Syntax: [ Download ] [ Hide ]
Array
(
    [0] => Foo ###123
   [thenumber] => 123
    [1] => 123
)


Notice that it doesn't replace the numeric backreference altogether, it simply adds a named one too.

I feel like I've taken you far enough with this now and all you can do is to keep practising and using regex.

Don't forget you can nest these little commands too ;) ...

Syntax: [ Download ] [ Hide ]
/[a-z]+\d(?-iFOO(?=(?ibar)))/i


Enjoy playing with those advanced features!

If I've made any mistakes please have a whinge so that I can correct them :D

Have fun!


Last edited by Chris Corbyn on Tue Nov 01, 2005 6:51 pm, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Tue Nov 01, 2005 6:46 pm 
Offline
Spockulator
User avatar

Joined: Wed Feb 04, 2004 9:15 pm
Posts: 4712
Location: Eden, Utah
woot woot! U da man d11!


Top
 Profile  
 
 Post subject:
PostPosted: Sat Nov 05, 2005 3:06 am 
Offline
DevNet Resident
User avatar

Joined: Fri Dec 24, 2004 3:59 am
Posts: 1452
Location: Lucknow, UP, India
named backrefs are kinda new to me... thanks d11 :wink:

edit: are there any version limitation for using named backrefs :?:


Top
 Profile  
 
 Post subject:
PostPosted: Sat Nov 05, 2005 9:06 pm 
Offline
Breakbeat Nuttzer
User avatar

Joined: Wed Mar 24, 2004 8:57 am
Posts: 13098
Location: Melbourne, Australia
n00b Saibot wrote:
named backrefs are kinda new to me... thanks d11 :wink:

edit: are there any version limitation for using named backrefs :?:


I'm not 100%, I know they work in all the PCRE stuff in PHP4 and 5.... In which case I guess they work in Perl itself, but it's been a while since I used perl to test that. JavaScript doesn't work with that syntax neither.... at least, Firefox gives an "Invalid Quantifier" error ;)

I'd tend to avoid using it unless you have a real need for it :D

EDIT | Doesn't seem to agree very well with the perl regex engine :(

Syntax: [ Download ] [ Hide ]
#!/usr/bin/perl

$foo = "Foo 123";

$foo =~ /Foo (?P<Num>\d+)/;


Errors...
Syntax: [ Download ] [ Hide ]
Sequence (?P...) not recognized in regex; marked by <-- HERE in m/Foo (?P <-- HERE <Num>\d+)/ at foo.pl line 5


So all I can say is it works in PHP and apparently not alot else :P


Top
 Profile  
 
 Post subject:
PostPosted: Sun Nov 06, 2005 11:27 am 
Offline
Moderator
User avatar

Joined: Mon Nov 03, 2003 7:13 pm
Posts: 5975
Location: Odessa, Ukraine
d11wtq wrote:
So all I can say is it works in PHP and apparently not alot else :P

It works in Python (it's where it started from), PCRE and .NET. MS version, as usual, uses its propriate syntax incompatible with Python and PCRE.

Support for named backreferences is planned for Perl 6.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Nov 09, 2005 1:26 pm 
Offline
DevNet Master

Joined: Wed Feb 11, 2004 4:23 pm
Posts: 4872
Location: Palm beach, Florida
in perl i think it is

(?<Suffix>

instead of

(?P<Suffix>

great tutorial, learned a few things


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jan 31, 2006 8:23 pm 
Offline
DevNet Resident
User avatar

Joined: Sat Jun 11, 2005 6:57 am
Posts: 1451
Location: London, UK
I have a few doubts...why common expressions cannot be used for these?

Original: /(?<=sun)[a-z]+/i
Replacement: /(sun)[a-z]+/i



Original: /[a-z]+(?=shine)/i
Replacement: /[a-z]+(shine)/i


But I think '!' with ? and ?< are useful because I think they cannot represented like the above...
hopefully this does not work...
/[a-z]+(^shine)/i


Top
 Profile  
 
 Post subject:
PostPosted: Wed Feb 01, 2006 4:30 am 
Offline
Breakbeat Nuttzer
User avatar

Joined: Wed Mar 24, 2004 8:57 am
Posts: 13098
Location: Melbourne, Australia
raghavan20 wrote:
I have a few doubts...why common expressions cannot be used for these?

Original: /(?<=sun)[a-z]+/i
Replacement: /(sun)[a-z]+/i



Original: /[a-z]+(?=shine)/i
Replacement: /[a-z]+(shine)/i


But I think '!' with ? and ?< are useful because I think they cannot represented like the above...
hopefully this does not work...
/[a-z]+(^shine)/i


You're right.... the caret doesn't act as a negation operator inside anything but character classes [^abc].

The example I gave for the lookaheads/lookbehinds weren't exactly real-life examples since in reality you'd be using these in fairly complex expressions.... and that would scare some people away from regex in the scope of this ;)


Top
 Profile  
 
 Post subject:
PostPosted: Fri Oct 06, 2006 12:19 pm 
Offline
Forum Newbie

Joined: Fri Jun 09, 2006 12:11 pm
Posts: 4
Thanks.

Great work on one of the complex subject. Helps me lot.

Thanks again.


Top
 Profile  
 
PostPosted: Wed Jun 18, 2008 4:08 am 
Offline
Forum Commoner

Joined: Tue Jun 19, 2007 10:54 am
Posts: 73
Thanks, great crash course.

But I didn't understood most of it. Can someone please post some pointers where I can learn these advanced concepts in great detail(instead of crash course)?


Top
 Profile  
 
PostPosted: Wed Jun 18, 2008 8:49 am 
Offline
Forum Contributor
User avatar

Joined: Sun Oct 22, 2006 1:47 am
Posts: 274
Location: Belgium
claws wrote:
Thanks, great crash course.

But I didn't understood most of it. Can someone please post some pointers where I can learn these advanced concepts in great detail(instead of crash course)?


"Mastering Regular Expressions" by Jeffrey Friedl definitely lives up to its title.

http://oreilly.com/catalog/9780596002893/


Top
 Profile  
 
PostPosted: Fri Dec 18, 2009 6:46 pm 
Offline
Forum Contributor

Joined: Sat Jan 03, 2009 4:27 pm
Posts: 148
raghaven20 wrote:
I have a few doubts...why common expressions cannot be used for these?

Original: /(?<=sun)[a-z]+/i
Replacement: /(sun)[a-z]+/i



Original: /[a-z]+(?=shine)/i
Replacement: /[a-z]+(shine)/i


Perhaps a better example would be:

Syntax: [ Download ] [ Hide ]
/^(?=.{5,10})[a-z]{1,9}[0-9]{1,9}$/i


Which forces the string to be between 5 and 10 characters inclusive in length.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 12 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group