If you didn't read the crash course before you jumped into this tutorial it may be a good idea to do so unless you already have a grasp of regex basics.
A few sites I need to point out (again):
http://www.regular-expressions.info/ (Tutorials for regex in a few programming languages)
http://www.perl.com/doc/manual/html/pod/perlre.html (Perl Documentation for TRUE perl style regex)
http://www.weitz.de/regex-coach/ (Fantastic application called Regex-Coach!)
From this point on, we might as well consider everything written to be perl-style regex since the rest don't get particularly advanced.
A quick re-cap:
In the super-speedy paced crash course we looked at the metacharacters, the quantifiers, the modifiers and some PHP functions.
That gave us enough information to start constructing some simple regex. However, even with that basic knowledge you may still hit a few hurdles trying to do some fancy things with regex.
Ready? We're off!
How does the regex engine work?
In a pure technical sense I really wouldn't have a clue. But in a conceptual sense it works like this...
The regex engine reads the regex as it reads the string it's checking against. If the regex engine is satisfied that everything in the pattern has been matched it does not look any further into the remaining string (without modifiers).
By default... the regex engine will try to match *everything* the pattern tells it to match.
Quantifiers can change the behaviour of the regex engine quite significantly and can cuase hours of confusion and annoyance among developers. This is to do with something we call "greediness" in regex terms. Lets look at this more closely.
Pattern Greediness in Regular Expressions:
If you use a quantifier which allows matching of characters up to any number of times, the regex engine will try to fulfill that requirement as best it can.
String: Foo ###123 bar
Code: Select all
/^[a-z]+.*(\d+)/iCode: Select all
Array
(
[0] => Foo ###123
[1] => 3
)"[a-z]+" .. OK that's good
".*" .. Any character any number of times. This is whaere it collapses.
The dot-star combination is the evilest of the greedy patterns because it really will just match everything it can (except newline chars, without the "s" modifier).
Foo was picked up by the character class [a-z] since our pattern used the "i" modifier. The .* consumed the rest of our string less one number because the next metacharacter in the sequence used the "+" quantifier which allows at least one character to be matched.
So how do you fix that issue? -- Answer: You combine the greedy quantifier with the "?" quantifier. This makes that part of the pattern "ungreedy".
Code: Select all
/^[a-z]+.*?(\d+)/iCode: Select all
Array
(
[0] => Foo ###123
[1] => 123
)Note: There is a "U" pattern modifier which makes the entire pattern ungreedy by default... use with caution!
From this point on... the tutorial is covering some advanced concepts. You'll only really need to use this stuff when you are writing very long patterns etc but anyway...
Special commands:
Regex can do some really clever things using instructions in the middle of the pattern. The syntax for providing these instructions is
Code: Select all
(?instruction)Mid-pattern modifiers:
You've seen that you can modify the behaviour of a pattern by adding some letters after the closing delimiter. Brilliant! Guess what, we can twist and bend the behaviour of our regex mid-pattern by doing something similar
The basic syntax is like this
Code: Select all
<<< Modify the part in parens to TURN ON the modifier >>>
(?i ... )
(?m ... )
(?s ... )
(?U ... )
<<< Modify the part in parens to TURN OFF the modifier >>>
(?-i ... )
(?-m ... )
(?-s ... )
(?-U ... )An example usage:
String 1: Where IS the UK?
String 2: where is the UK?
Code: Select all
/[a-z\s]*?(?-iUK)/iLookaheads:
Lookaheads come in two flavours. Positive and negative. What they do, is check if a particular string follows part of the pattern. You wont often need these since you can normally just put the string itself into the pattern.
Syntax:
Code: Select all
pattern(?= ... )String: Sunshine
Code: Select all
/[a-z]+(?=shine)/iNegative lookaheads mean that part of the pattern *must not* be followed by the lookahead.
Syntax:
Code: Select all
pattern(?! ... )Code: Select all
/sun(?!shine)[a-z]*/iFixed-width Lookbehinds:
These can prove very useful. They have one drawback however. You need to know the size of whatever goes in the lookbehind due to the way the regex engine works. What they do is exactly the same as lookaheads except that they are looking backwards. The pattern it applies to must, or must not follow the lookbehind depending upon whether it is positive or negative.
Syntax for positive:
Code: Select all
(?<= ... )patternExample positive lookbehind:
Code: Select all
/(?<=sun)[a-z]+/iNegative lookbehind syntax:
Code: Select all
(?<! ... )patternCode: Select all
/\b(?<!a)[a-z]+/iGrouping with parens without extracting:
If you surround parts of your pattern in parens they will end up in backreferences as you've seen. These special commands with (? ... ) don't behave in that way however. There's a little command that simpy tells the regex engine to group characters, but not extract them.
Syntax:
Code: Select all
(?: ... )Code: Select all
/Foo(?:bar)+/Extracting named backreferences:
This is nice and all, but it may confuse anyone who only knows pretty standard regex. It basically allows you to name all your backreferences (extracted parts) so that you can make more readable code.
Syntax:
Code: Select all
(?P<Name> ... )Example:
Lets use our first one
String: Foo ###123 bar
Code: Select all
/^[a-z]+.*?(?P<thenumber>\d+)/iCode: Select all
Array
(
[0] => Foo ###123
[thenumber] => 123
[1] => 123
)I feel like I've taken you far enough with this now and all you can do is to keep practising and using regex.
Don't forget you can nest these little commands too
Code: Select all
/[a-z]+\d(?-iFOO(?=(?ibar)))/iIf I've made any mistakes please have a whinge so that I can correct them
Have fun!