(b) Regex Advanced tutorial - (CRASH Course Pt. 2)

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

(b) Regex Advanced tutorial - (CRASH Course Pt. 2)

Postby Chris Corbyn » Tue Nov 01, 2005 6:33 pm

About time I got around to doing this :)

If you didn't read the crash course before you jumped into this tutorial it may be a good idea to do so unless you already have a grasp of regex basics.

A few sites I need to point out (again):
http://www.regular-expressions.info/ (Tutorials for regex in a few programming languages)
http://www.perl.com/doc/manual/html/pod/perlre.html (Perl Documentation for TRUE perl style regex)
http://www.weitz.de/regex-coach/ (Fantastic application called Regex-Coach!)

From this point on, we might as well consider everything written to be perl-style regex since the rest don't get particularly advanced.

A quick re-cap:

In the super-speedy paced crash course we looked at the metacharacters, the quantifiers, the modifiers and some PHP functions.

That gave us enough information to start constructing some simple regex. However, even with that basic knowledge you may still hit a few hurdles trying to do some fancy things with regex.

Ready? We're off!

How does the regex engine work?

In a pure technical sense I really wouldn't have a clue. But in a conceptual sense it works like this...

The regex engine reads the regex as it reads the string it's checking against. If the regex engine is satisfied that everything in the pattern has been matched it does not look any further into the remaining string (without modifiers).

By default... the regex engine will try to match *everything* the pattern tells it to match.

Quantifiers can change the behaviour of the regex engine quite significantly and can cuase hours of confusion and annoyance among developers. This is to do with something we call "greediness" in regex terms. Lets look at this more closely.

Pattern Greediness in Regular Expressions:

If you use a quantifier which allows matching of characters up to any number of times, the regex engine will try to fulfill that requirement as best it can.

String: Foo ###123 bar
Syntax: [ Download ] [ Hide ]
/^[a-z]+.*(\d+)/i


The above regex, to anybody not looking closely appears to extract the "123" from the string. In actual fact, it does not do this...

Syntax: [ Download ] [ Hide ]
Array
(
    [0] => Foo ###123
   [1] => 3
)


So what happened?

"[a-z]+" .. OK that's good
".*" .. Any character any number of times. This is whaere it collapses.

The dot-star combination is the evilest of the greedy patterns because it really will just match everything it can (except newline chars, without the "s" modifier).

Foo was picked up by the character class [a-z] since our pattern used the "i" modifier. The .* consumed the rest of our string less one number because the next metacharacter in the sequence used the "+" quantifier which allows at least one character to be matched.

So how do you fix that issue? -- Answer: You combine the greedy quantifier with the "?" quantifier. This makes that part of the pattern "ungreedy".

Syntax: [ Download ] [ Hide ]
/^[a-z]+.*?(\d+)/i


Produces

Syntax: [ Download ] [ Hide ]
Array
(
    [0] => Foo ###123
   [1] => 123
)


Essentially, we've told the regex engine to always check if the next part the pattern *can* feasibly match the following character.

Note: There is a "U" pattern modifier which makes the entire pattern ungreedy by default... use with caution!

From this point on... the tutorial is covering some advanced concepts. You'll only really need to use this stuff when you are writing very long patterns etc but anyway...

Special commands:

Regex can do some really clever things using instructions in the middle of the pattern. The syntax for providing these instructions is

Syntax: [ Download ] [ Hide ]
(?instruction)


We'll bring this into play from here on.

Mid-pattern modifiers:

You've seen that you can modify the behaviour of a pattern by adding some letters after the closing delimiter. Brilliant! Guess what, we can twist and bend the behaviour of our regex mid-pattern by doing something similar ;) You'll like this.

The basic syntax is like this

Syntax: [ Download ] [ Hide ]
<<< Modify the part in parens to TURN ON the modifier >>>

(?i ... )
(?m ... )
(?s ... )
(?U ... )

<<< Modify the part in parens to TURN OFF the modifier >>>

(?-i ... )
(?-m ... )
(?-s ... )
(?-U ... )


Those letters, "i", "m" etc are the same pattern modifiers we used in the crash course... but now we can use them *inside* our pattern. This is only really handy if you have some very non-uniform string to match or you are writing a very long pattern.

An example usage:

String 1: Where IS the UK?
String 2: where is the UK?

Syntax: [ Download ] [ Hide ]
/[a-z\s]*?(?-iUK)/i


Lets say we always want UK to be in uppercase but the rest of the string is likely to have uppercase and lowercase characters in different places depending who typed it. We use the "i" modifier to account for the differences in the way people write... but what about UK being uppercase? Here we have disabled the "i" modifier for that specific part of the pattern so it matches uppercase UK specifically.

Lookaheads:

Lookaheads come in two flavours. Positive and negative. What they do, is check if a particular string follows part of the pattern. You wont often need these since you can normally just put the string itself into the pattern.

Syntax:
Syntax: [ Download ] [ Hide ]
pattern(?= ... )


In the above, "pattern" *must* be followed by whatever is after the "?=" in the parens.

String: Sunshine
Syntax: [ Download ] [ Hide ]
/[a-z]+(?=shine)/i


The above pattern matches the word "Sun"... it would also match the "Moon" in Moonshine.

Negative lookaheads mean that part of the pattern *must not* be followed by the lookahead.

Syntax:
Syntax: [ Download ] [ Hide ]
pattern(?! ... )


Example:
Syntax: [ Download ] [ Hide ]
/sun(?!shine)[a-z]*/i


The above pattern will match any word starting with "sun" but NOT starting with "sunshine".

Fixed-width Lookbehinds:

These can prove very useful. They have one drawback however. You need to know the size of whatever goes in the lookbehind due to the way the regex engine works. What they do is exactly the same as lookaheads except that they are looking backwards. The pattern it applies to must, or must not follow the lookbehind depending upon whether it is positive or negative.

Syntax for positive:
Syntax: [ Download ] [ Hide ]
(?<= ... )pattern


Notice that the lookbehind physically goes before the pattern?

Example positive lookbehind:
Syntax: [ Download ] [ Hide ]
/(?<=sun)[a-z]+/i


The above pattern will the end of any word starting with "sun", such as "shine" or "light".

Negative lookbehind syntax:
Syntax: [ Download ] [ Hide ]
(?<! ... )pattern


Example negative lookbehind:
Syntax: [ Download ] [ Hide ]
/\b(?<!a)[a-z]+/i


The above matches any word which does NOT start with a letter "a". The \b assertion just makes sure were at the start of a word.

Grouping with parens without extracting:

If you surround parts of your pattern in parens they will end up in backreferences as you've seen. These special commands with (? ... ) don't behave in that way however. There's a little command that simpy tells the regex engine to group characters, but not extract them.

Syntax:
Syntax: [ Download ] [ Hide ]
(?: ... )


Example:
Syntax: [ Download ] [ Hide ]
/Foo(?:bar)+/


That matches Foobar, Foobarbar Foobarbarbarbarbarbar ... etc etc. The advantages of using that little command are that you'll save a neglible amount of memory and speed up the matching slightly. In the real-world you'll use these a fair bit and they prove to be very handy at preveinting things from getting cluttered in larger patterns.

Extracting named backreferences:

This is nice and all, but it may confuse anyone who only knows pretty standard regex. It basically allows you to name all your backreferences (extracted parts) so that you can make more readable code.

Syntax:
Syntax: [ Download ] [ Hide ]
(?P<Name> ... )


That's an UPPERCASE "P" and those less-than/greater-than symbols really are supposed to be there!

Example:
Lets use our first one
String: Foo ###123 bar

Syntax: [ Download ] [ Hide ]
/^[a-z]+.*?(?P<thenumber>\d+)/i


This produces the following
Syntax: [ Download ] [ Hide ]
Array
(
    [0] => Foo ###123
   [thenumber] => 123
    [1] => 123
)


Notice that it doesn't replace the numeric backreference altogether, it simply adds a named one too.

I feel like I've taken you far enough with this now and all you can do is to keep practising and using regex.

Don't forget you can nest these little commands too ;) ...

Syntax: [ Download ] [ Hide ]
/[a-z]+\d(?-iFOO(?=(?ibar)))/i


Enjoy playing with those advanced features!

If I've made any mistakes please have a whinge so that I can correct them :D

Have fun!
Last edited by Chris Corbyn on Tue Nov 01, 2005 6:51 pm, edited 1 time in total.
User avatar
Chris Corbyn
Breakbeat Nuttzer
 
Posts: 13081
Joined: Wed Mar 24, 2004 8:57 am
Location: Melbourne, Australia

Postby Burrito » Tue Nov 01, 2005 6:46 pm

woot woot! U da man d11!
Wag More, Bark Less
User avatar
Burrito
Spockulator
 
Posts: 4700
Joined: Wed Feb 04, 2004 9:15 pm
Location: Eden, Utah

Postby n00b Saibot » Sat Nov 05, 2005 3:06 am

named backrefs are kinda new to me... thanks d11 :wink:

edit: are there any version limitation for using named backrefs :?:
User avatar
n00b Saibot
DevNet Resident
 
Posts: 1452
Joined: Fri Dec 24, 2004 3:59 am
Location: Lucknow, UP, India

Postby Chris Corbyn » Sat Nov 05, 2005 9:06 pm

n00b Saibot wrote:named backrefs are kinda new to me... thanks d11 :wink:

edit: are there any version limitation for using named backrefs :?:


I'm not 100%, I know they work in all the PCRE stuff in PHP4 and 5.... In which case I guess they work in Perl itself, but it's been a while since I used perl to test that. JavaScript doesn't work with that syntax neither.... at least, Firefox gives an "Invalid Quantifier" error ;)

I'd tend to avoid using it unless you have a real need for it :D

EDIT | Doesn't seem to agree very well with the perl regex engine :(

Syntax: [ Download ] [ Hide ]
#!/usr/bin/perl

$foo = "Foo 123";

$foo =~ /Foo (?P<Num>\d+)/;


Errors...
Syntax: [ Download ] [ Hide ]
Sequence (?P...) not recognized in regex; marked by <-- HERE in m/Foo (?P <-- HERE <Num>\d+)/ at foo.pl line 5


So all I can say is it works in PHP and apparently not alot else :P
User avatar
Chris Corbyn
Breakbeat Nuttzer
 
Posts: 13081
Joined: Wed Mar 24, 2004 8:57 am
Location: Melbourne, Australia

Postby Weirdan » Sun Nov 06, 2005 11:27 am

d11wtq wrote:So all I can say is it works in PHP and apparently not alot else :P

It works in Python (it's where it started from), PCRE and .NET. MS version, as usual, uses its propriate syntax incompatible with Python and PCRE.

Support for named backreferences is planned for Perl 6.
Image
User avatar
Weirdan
Moderator
 
Posts: 5085
Joined: Mon Nov 03, 2003 7:13 pm
Location: Odessa, Ukraine

Postby josh » Wed Nov 09, 2005 1:26 pm

in perl i think it is

(?<Suffix>

instead of

(?P<Suffix>

great tutorial, learned a few things
Josh - Devnet's angriest programmer.
User avatar
josh
DevNet Master
 
Posts: 4254
Joined: Wed Feb 11, 2004 4:23 pm
Location: Palm beach, Florida

Postby raghavan20 » Tue Jan 31, 2006 8:23 pm

I have a few doubts...why common expressions cannot be used for these?

Original: /(?<=sun)[a-z]+/i
Replacement: /(sun)[a-z]+/i



Original: /[a-z]+(?=shine)/i
Replacement: /[a-z]+(shine)/i


But I think '!' with ? and ?< are useful because I think they cannot represented like the above...
hopefully this does not work...
/[a-z]+(^shine)/i
User avatar
raghavan20
DevNet Resident
 
Posts: 1451
Joined: Sat Jun 11, 2005 6:57 am
Location: London, UK

Postby Chris Corbyn » Wed Feb 01, 2006 4:30 am

raghavan20 wrote:I have a few doubts...why common expressions cannot be used for these?

Original: /(?<=sun)[a-z]+/i
Replacement: /(sun)[a-z]+/i



Original: /[a-z]+(?=shine)/i
Replacement: /[a-z]+(shine)/i


But I think '!' with ? and ?< are useful because I think they cannot represented like the above...
hopefully this does not work...
/[a-z]+(^shine)/i


You're right.... the caret doesn't act as a negation operator inside anything but character classes [^abc].

The example I gave for the lookaheads/lookbehinds weren't exactly real-life examples since in reality you'd be using these in fairly complex expressions.... and that would scare some people away from regex in the scope of this ;)
User avatar
Chris Corbyn
Breakbeat Nuttzer
 
Posts: 13081
Joined: Wed Mar 24, 2004 8:57 am
Location: Melbourne, Australia

Postby harsh789 » Fri Oct 06, 2006 12:19 pm

Thanks.

Great work on one of the complex subject. Helps me lot.

Thanks again.
harsh789
Forum Newbie
 
Posts: 4
Joined: Fri Jun 09, 2006 12:11 pm

Re: (b) Regex Advanced tutorial - (CRASH Course Pt. 2)

Postby claws » Wed Jun 18, 2008 4:08 am

Thanks, great crash course.

But I didn't understood most of it. Can someone please post some pointers where I can learn these advanced concepts in great detail(instead of crash course)?
claws
Forum Commoner
 
Posts: 73
Joined: Tue Jun 19, 2007 10:54 am

Re: (b) Regex Advanced tutorial - (CRASH Course Pt. 2)

Postby GeertDD » Wed Jun 18, 2008 8:49 am

claws wrote:Thanks, great crash course.

But I didn't understood most of it. Can someone please post some pointers where I can learn these advanced concepts in great detail(instead of crash course)?


"Mastering Regular Expressions" by Jeffrey Friedl definitely lives up to its title.

http://oreilly.com/catalog/9780596002893/
User avatar
GeertDD
Forum Contributor
 
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: (b) Regex Advanced tutorial - (CRASH Course Pt. 2)

Postby MichaelR » Fri Dec 18, 2009 6:46 pm

raghaven20 wrote:I have a few doubts...why common expressions cannot be used for these?

Original: /(?<=sun)[a-z]+/i
Replacement: /(sun)[a-z]+/i



Original: /[a-z]+(?=shine)/i
Replacement: /[a-z]+(shine)/i


Perhaps a better example would be:

Syntax: [ Download ] [ Hide ]
/^(?=.{5,10})[a-z]{1,9}[0-9]{1,9}$/i


Which forces the string to be between 5 and 10 characters inclusive in length.
MichaelR
Forum Commoner
 
Posts: 97
Joined: Sat Jan 03, 2009 4:27 pm


Return to Regex

Who is online

Users browsing this forum: No registered users and 1 guest