(b) Regex Advanced tutorial - (CRASH Course Pt. 2)

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

(b) Regex Advanced tutorial - (CRASH Course Pt. 2)

Post by Chris Corbyn »

About time I got around to doing this :)

If you didn't read the crash course before you jumped into this tutorial it may be a good idea to do so unless you already have a grasp of regex basics.

A few sites I need to point out (again):
http://www.regular-expressions.info/ (Tutorials for regex in a few programming languages)
http://www.perl.com/doc/manual/html/pod/perlre.html (Perl Documentation for TRUE perl style regex)
http://www.weitz.de/regex-coach/ (Fantastic application called Regex-Coach!)

From this point on, we might as well consider everything written to be perl-style regex since the rest don't get particularly advanced.

A quick re-cap:

In the super-speedy paced crash course we looked at the metacharacters, the quantifiers, the modifiers and some PHP functions.

That gave us enough information to start constructing some simple regex. However, even with that basic knowledge you may still hit a few hurdles trying to do some fancy things with regex.

Ready? We're off!

How does the regex engine work?

In a pure technical sense I really wouldn't have a clue. But in a conceptual sense it works like this...

The regex engine reads the regex as it reads the string it's checking against. If the regex engine is satisfied that everything in the pattern has been matched it does not look any further into the remaining string (without modifiers).

By default... the regex engine will try to match *everything* the pattern tells it to match.

Quantifiers can change the behaviour of the regex engine quite significantly and can cuase hours of confusion and annoyance among developers. This is to do with something we call "greediness" in regex terms. Lets look at this more closely.

Pattern Greediness in Regular Expressions:

If you use a quantifier which allows matching of characters up to any number of times, the regex engine will try to fulfill that requirement as best it can.

String: Foo ###123 bar

Code: Select all

/^[a-z]+.*(\d+)/i
The above regex, to anybody not looking closely appears to extract the "123" from the string. In actual fact, it does not do this...

Code: Select all

Array
(
    [0] => Foo ###123
    [1] => 3
)
So what happened?

"[a-z]+" .. OK that's good
".*" .. Any character any number of times. This is whaere it collapses.

The dot-star combination is the evilest of the greedy patterns because it really will just match everything it can (except newline chars, without the "s" modifier).

Foo was picked up by the character class [a-z] since our pattern used the "i" modifier. The .* consumed the rest of our string less one number because the next metacharacter in the sequence used the "+" quantifier which allows at least one character to be matched.

So how do you fix that issue? -- Answer: You combine the greedy quantifier with the "?" quantifier. This makes that part of the pattern "ungreedy".

Code: Select all

/^[a-z]+.*?(\d+)/i
Produces

Code: Select all

Array
(
    [0] => Foo ###123
    [1] => 123
)
Essentially, we've told the regex engine to always check if the next part the pattern *can* feasibly match the following character.

Note: There is a "U" pattern modifier which makes the entire pattern ungreedy by default... use with caution!

From this point on... the tutorial is covering some advanced concepts. You'll only really need to use this stuff when you are writing very long patterns etc but anyway...

Special commands:

Regex can do some really clever things using instructions in the middle of the pattern. The syntax for providing these instructions is

Code: Select all

(?instruction)
We'll bring this into play from here on.

Mid-pattern modifiers:

You've seen that you can modify the behaviour of a pattern by adding some letters after the closing delimiter. Brilliant! Guess what, we can twist and bend the behaviour of our regex mid-pattern by doing something similar ;) You'll like this.

The basic syntax is like this

Code: Select all

<<< Modify the part in parens to TURN ON the modifier >>>

(?i ... )
(?m ... )
(?s ... )
(?U ... )

<<< Modify the part in parens to TURN OFF the modifier >>>

(?-i ... )
(?-m ... )
(?-s ... )
(?-U ... )
Those letters, "i", "m" etc are the same pattern modifiers we used in the crash course... but now we can use them *inside* our pattern. This is only really handy if you have some very non-uniform string to match or you are writing a very long pattern.

An example usage:

String 1: Where IS the UK?
String 2: where is the UK?

Code: Select all

/[a-z\s]*?(?-iUK)/i
Lets say we always want UK to be in uppercase but the rest of the string is likely to have uppercase and lowercase characters in different places depending who typed it. We use the "i" modifier to account for the differences in the way people write... but what about UK being uppercase? Here we have disabled the "i" modifier for that specific part of the pattern so it matches uppercase UK specifically.

Lookaheads:

Lookaheads come in two flavours. Positive and negative. What they do, is check if a particular string follows part of the pattern. You wont often need these since you can normally just put the string itself into the pattern.

Syntax:

Code: Select all

pattern(?= ... )
In the above, "pattern" *must* be followed by whatever is after the "?=" in the parens.

String: Sunshine

Code: Select all

/[a-z]+(?=shine)/i
The above pattern matches the word "Sun"... it would also match the "Moon" in Moonshine.

Negative lookaheads mean that part of the pattern *must not* be followed by the lookahead.

Syntax:

Code: Select all

pattern(?! ... )
Example:

Code: Select all

/sun(?!shine)[a-z]*/i
The above pattern will match any word starting with "sun" but NOT starting with "sunshine".

Fixed-width Lookbehinds:

These can prove very useful. They have one drawback however. You need to know the size of whatever goes in the lookbehind due to the way the regex engine works. What they do is exactly the same as lookaheads except that they are looking backwards. The pattern it applies to must, or must not follow the lookbehind depending upon whether it is positive or negative.

Syntax for positive:

Code: Select all

(?<= ... )pattern
Notice that the lookbehind physically goes before the pattern?

Example positive lookbehind:

Code: Select all

/(?<=sun)[a-z]+/i
The above pattern will the end of any word starting with "sun", such as "shine" or "light".

Negative lookbehind syntax:

Code: Select all

(?<! ... )pattern
Example negative lookbehind:

Code: Select all

/\b(?<!a)[a-z]+/i
The above matches any word which does NOT start with a letter "a". The \b assertion just makes sure were at the start of a word.

Grouping with parens without extracting:

If you surround parts of your pattern in parens they will end up in backreferences as you've seen. These special commands with (? ... ) don't behave in that way however. There's a little command that simpy tells the regex engine to group characters, but not extract them.

Syntax:

Code: Select all

(?: ... )
Example:

Code: Select all

/Foo(?:bar)+/
That matches Foobar, Foobarbar Foobarbarbarbarbarbar ... etc etc. The advantages of using that little command are that you'll save a neglible amount of memory and speed up the matching slightly. In the real-world you'll use these a fair bit and they prove to be very handy at preveinting things from getting cluttered in larger patterns.

Extracting named backreferences:

This is nice and all, but it may confuse anyone who only knows pretty standard regex. It basically allows you to name all your backreferences (extracted parts) so that you can make more readable code.

Syntax:

Code: Select all

(?P<Name> ... )
That's an UPPERCASE "P" and those less-than/greater-than symbols really are supposed to be there!

Example:
Lets use our first one
String: Foo ###123 bar

Code: Select all

/^[a-z]+.*?(?P<thenumber>\d+)/i
This produces the following

Code: Select all

Array
(
    [0] => Foo ###123
    [thenumber] => 123
    [1] => 123
)
Notice that it doesn't replace the numeric backreference altogether, it simply adds a named one too.

I feel like I've taken you far enough with this now and all you can do is to keep practising and using regex.

Don't forget you can nest these little commands too ;) ...

Code: Select all

/[a-z]+\d(?-iFOO(?=(?ibar)))/i
Enjoy playing with those advanced features!

If I've made any mistakes please have a whinge so that I can correct them :D

Have fun!
Last edited by Chris Corbyn on Tue Nov 01, 2005 5:51 pm, edited 1 time in total.
User avatar
Burrito
Spockulator
Posts: 4715
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Post by Burrito »

woot woot! U da man d11!
User avatar
n00b Saibot
DevNet Resident
Posts: 1452
Joined: Fri Dec 24, 2004 2:59 am
Location: Lucknow, UP, India
Contact:

Post by n00b Saibot »

named backrefs are kinda new to me... thanks d11 :wink:

edit: are there any version limitation for using named backrefs :?:
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

n00b Saibot wrote:named backrefs are kinda new to me... thanks d11 :wink:

edit: are there any version limitation for using named backrefs :?:
I'm not 100%, I know they work in all the PCRE stuff in PHP4 and 5.... In which case I guess they work in Perl itself, but it's been a while since I used perl to test that. JavaScript doesn't work with that syntax neither.... at least, Firefox gives an "Invalid Quantifier" error ;)

I'd tend to avoid using it unless you have a real need for it :D

EDIT | Doesn't seem to agree very well with the perl regex engine :(

Code: Select all

#!/usr/bin/perl

$foo = "Foo 123";

$foo =~ /Foo (?P<Num>\d+)/;
Errors...

Code: Select all

Sequence (?P...) not recognized in regex; marked by <-- HERE in m/Foo (?P <-- HERE <Num>\d+)/ at foo.pl line 5
So all I can say is it works in PHP and apparently not alot else :P
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

d11wtq wrote: So all I can say is it works in PHP and apparently not alot else :P
It works in Python (it's where it started from), PCRE and .NET. MS version, as usual, uses its propriate syntax incompatible with Python and PCRE.

Support for named backreferences is planned for Perl 6.
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Post by josh »

in perl i think it is

(?<Suffix>

instead of

(?P<Suffix>

great tutorial, learned a few things
User avatar
raghavan20
DevNet Resident
Posts: 1451
Joined: Sat Jun 11, 2005 6:57 am
Location: London, UK
Contact:

Post by raghavan20 »

I have a few doubts...why common expressions cannot be used for these?

Original: /(?<=sun)[a-z]+/i
Replacement: /(sun)[a-z]+/i



Original: /[a-z]+(?=shine)/i
Replacement: /[a-z]+(shine)/i


But I think '!' with ? and ?< are useful because I think they cannot represented like the above...
hopefully this does not work...
/[a-z]+(^shine)/i
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

raghavan20 wrote:I have a few doubts...why common expressions cannot be used for these?

Original: /(?<=sun)[a-z]+/i
Replacement: /(sun)[a-z]+/i



Original: /[a-z]+(?=shine)/i
Replacement: /[a-z]+(shine)/i


But I think '!' with ? and ?< are useful because I think they cannot represented like the above...
hopefully this does not work...
/[a-z]+(^shine)/i
You're right.... the caret doesn't act as a negation operator inside anything but character classes [^abc].

The example I gave for the lookaheads/lookbehinds weren't exactly real-life examples since in reality you'd be using these in fairly complex expressions.... and that would scare some people away from regex in the scope of this ;)
harsh789
Forum Newbie
Posts: 4
Joined: Fri Jun 09, 2006 12:11 pm

Post by harsh789 »

Thanks.

Great work on one of the complex subject. Helps me lot.

Thanks again.
claws
Forum Commoner
Posts: 73
Joined: Tue Jun 19, 2007 10:54 am

Re: (b) Regex Advanced tutorial - (CRASH Course Pt. 2)

Post by claws »

Thanks, great crash course.

But I didn't understood most of it. Can someone please post some pointers where I can learn these advanced concepts in great detail(instead of crash course)?
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: (b) Regex Advanced tutorial - (CRASH Course Pt. 2)

Post by GeertDD »

claws wrote:Thanks, great crash course.

But I didn't understood most of it. Can someone please post some pointers where I can learn these advanced concepts in great detail(instead of crash course)?
"Mastering Regular Expressions" by Jeffrey Friedl definitely lives up to its title.

http://oreilly.com/catalog/9780596002893/
MichaelR
Forum Contributor
Posts: 148
Joined: Sat Jan 03, 2009 3:27 pm

Re: (b) Regex Advanced tutorial - (CRASH Course Pt. 2)

Post by MichaelR »

raghaven20 wrote:I have a few doubts...why common expressions cannot be used for these?

Original: /(?<=sun)[a-z]+/i
Replacement: /(sun)[a-z]+/i



Original: /[a-z]+(?=shine)/i
Replacement: /[a-z]+(shine)/i
Perhaps a better example would be:

Code: Select all

/^(?=.{5,10})[a-z]{1,9}[0-9]{1,9}$/i
Which forces the string to be between 5 and 10 characters inclusive in length.
Post Reply