Confusing regex

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
Eddyphp
Forum Newbie
Posts: 8
Joined: Sun Sep 20, 2009 7:43 am

Confusing regex

Post by Eddyphp »

Hi,

I read the great tutorial in this post however, can someone explain the logic of this expression and how the results differ from the one below?

preg_match('/href="([^"]+)"/', $line, $stringBetweenQuotes);

where $line string contains the following:

<li><a href="/Contact/">Contact</a></li></ul>

preg_match('/href="([\/\w]+)/',$line,$stringBetweenQuotes);

Patrick.
User avatar
jackpf
DevNet Resident
Posts: 2119
Joined: Sun Feb 15, 2009 7:22 pm
Location: Ipswich, UK

Re: Confusing regex

Post by jackpf »

I believe the first pattern will match anything that isn't a quote. The second will match any word characters.
Eddyphp
Forum Newbie
Posts: 8
Joined: Sun Sep 20, 2009 7:43 am

Re: Confusing regex

Post by Eddyphp »

When both code samples are run the array element [1] contains /Contact/

The first one seems to take a more complex approach so I was wondering was there any benefit? Also I'm not sure how the first one is logically extracting the string /Contract/
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Confusing regex

Post by ridgerunner »

Eddyphp wrote:Hi,

I read the great tutorial in this post however, can someone explain the logic of this expression and how the results differ from the one below?

preg_match('/href="([^"]+)"/', $line, $stringBetweenQuotes);

where $line string contains the following:

<li><a href="/Contact/">Contact</a></li></ul>

preg_match('/href="([\/\w]+)/',$line,$stringBetweenQuotes);

Patrick.
The first one: '[^"]+' simply means: match one or more charaters that are not double quotes" (which includes the '/' slash character and all '\w' word characters). This first regex is much better than the second because it matches any valid URL whereas the second one won't. For example, the first regex matches and captures each of the following valid links:

Code: Select all

<li><a href="/Contact/index.html">Contact</a></li></ul>
<li><a href="/Contact/?query1=value1&query2=value2">Contact</a></li></ul>
<li><a href="/Contact/#fragment">Contact</a></li></ul>
<li><a href="http://example.com/Contact/">Contact</a></li></ul>
<li><a href="http://example.com">Contact</a></li></ul>
while the second regex fails to completely capture any of them. It fails because it matches only the '/' slash and '\w' word characters and stops once it hits any '.' dot, '?' question mark, '&' ampersand, '#' hash mark, or ':' colon, all of which are perfectly valid (and common) URL characters.

The first regex is no more complex or less efficient than the second. They both use a repetition of a simple character class, although the first uses a negated character class (i.e. it begins with a caret '^' - for more on character classes read this). Also, your second regex does not require a match for the closing double quote and would thus return erroneous partial matches. For example, if you apply the second regex to the five URL examples above, you would capture the following incomplete URLs:

Code: Select all

/Contact/index
/Contact/
/Contact/
http
http
Hope this helps!
User avatar
jackpf
DevNet Resident
Posts: 2119
Joined: Sun Feb 15, 2009 7:22 pm
Location: Ipswich, UK

Re: Confusing regex

Post by jackpf »

But surely wouldn't

Code: Select all

/href\=\"(.*?)\"/i
be the same?

Since it's going to stop at the next quote anyway...surely you don't have to specify you want to match anything that isn't a quote, but you can just use a wildcard instead? I may be completely wrong 8O just thought I'd ask...
Eddyphp
Forum Newbie
Posts: 8
Joined: Sun Sep 20, 2009 7:43 am

Re: Confusing regex

Post by Eddyphp »

ridgerunner wrote:
Eddyphp wrote:Hi,

I read the great tutorial in this post however, can someone explain the logic of this expression and how the results differ from the one below?

preg_match('/href="([^"]+)"/', $line, $stringBetweenQuotes);

where $line string contains the following:

<li><a href="/Contact/">Contact</a></li></ul>

preg_match('/href="([\/\w]+)/',$line,$stringBetweenQuotes);

Patrick.
The first one: '[^"]+' simply means: match one or more charaters that are not double quotes" (which includes the '/' slash character and all '\w' word characters). This first regex is much better than the second because it matches any valid URL whereas the second one won't. For example, the first regex matches and captures each of the following valid links:

Code: Select all

<li><a href="/Contact/index.html">Contact</a></li></ul>
<li><a href="/Contact/?query1=value1&query2=value2">Contact</a></li></ul>
<li><a href="/Contact/#fragment">Contact</a></li></ul>
<li><a href="http://example.com/Contact/">Contact</a></li></ul>
<li><a href="http://example.com">Contact</a></li></ul>
while the second regex fails to completely capture any of them. It fails because it matches only the '/' slash and '\w' word characters and stops once it hits any '.' dot, '?' question mark, '&' ampersand, '#' hash mark, or ':' colon, all of which are perfectly valid (and common) URL characters.

The first regex is no more complex or less efficient than the second. They both use a repetition of a simple character class, although the first uses a negated character class (i.e. it begins with a caret '^' - for more on character classes read this). Also, your second regex does not require a match for the closing double quote and would thus return erroneous partial matches. For example, if you apply the second regex to the five URL examples above, you would capture the following incomplete URLs:

Code: Select all

/Contact/index
/Contact/
/Contact/
http
http
Hope this helps!

That's very helpful, the negated character class was confusing me. Just a few clarifications, how is the quote next to the ) stopping the search. Also where a search is used e.g. above href=" a pointer seems to be set to the next character in the string / where the extract ([^"]+) starts. Is there a way to include the full url in the returned variable?
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Confusing regex

Post by ridgerunner »

jackpf wrote:But surely wouldn't

Code: Select all

/href\=\"(.*?)\"/i
be the same? ...
Yes, this regex is functionally equivalent and will match the exact same text. Both will do the job nicely. The only difference is that this form is not quite as efficient as the '"([^"]+)"' form. To explain why, lets look in detail at how the PCRE regex engine handles matching the following text:

Code: Select all

<li><a href="/Contact/">Contact</a></li></ul>
The first literal char in the regex 'h' is compared to the first char in the string '<'. This does not match so the regex engine transmission "bumps-along" to the next char in the string 'l'. This also does not match. This failure to match and bump-along continues until the string pointer reaches the position of the 'href'. At this point the first char of the regex does match, so the regex attempts to match its second literal char 'r' to the 'r' in the string. This, too, matches. This continues until the literal string of the regex 'href="' matches the string text. So far so good. Then the regex encounters the opening parentheses and begins a new capture group (and this is where the two different regex forms diverge).

Lets look at how your lazy star regex handles the matching. The first time the lazy star is encountered, it tries to match nothing at all (it is lazy after all!) The engine saves a backtracking breadcrumb, exits the capturing parentheses then attempts to match the double quote. When this fails, the engine backtracks to the previously saved breadcrumb going back inside the capturing parentheses, then attempts to match the dot to the current character '/'. This matches successfully, but the lazy star (lazy as it is), again gives up, and the regex once again saves a backtracking breadcrumb trail, exits the capturing parentheses, and attempts to match the double quote. This fails and the regex is forced to backtrack to the previously saved breadcrumb, once again going back inside the capturing parentheses. The dot then matches the 'C'. This process of saving breadcrumb trails and backtracking in and out of the capturing parentheses continues until the entire '/Contact/' portion of the string is matched and finally the double quote also matches at which time the regex declares success. In this case, every character match between the double quotes requires the regex to save and restore backtracking information and to repeatedly move in and out of the capturing parentheses, which incurs memory and clock cycle overhead.

Now lets look at this improved regex:

Code: Select all

/href="([^"]++)"/i
Once inside the parentheses, this regex repeatedly attempts (via the possessive plus '++' quantifier) to match '[^"]' to the string. This matches the '/' char just fine. Then it matches the 'C' and then the 'o' and the 'n' and the 't' and so on until it encounters the double quote where it fails to match. At this point the '++' quantifier has done its job successfully matching one or more characters, so the regex then moves on and finally exits the parentheses. It then matches the literal double quote and declares overall success. In this case the capturing parentheses are entered and exited only once and no backtracking was necessary at all. In this case, between the double quotes we specify exactly what we want to match: any character that is not a double quote and we are rewarded with efficiency by doing so.

With this simple case (and small subject text), the efficiency difference between the two regex styles is negligible. However, it is always best to specify precisely what you want because this really pays off (time, $$$ and bandwidth) when you have a really big job to do, or if the regex is applied many many times over and over again (such as in the PHP code inside a forum software parser or within a .htaccess apache config file!) Once again, I highly recommend reading Friedl's Mastering Regular Expressions 3rd Edition - its chapters on the details of the regex engine and efficiency issues surely opened my eyes!
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Confusing regex

Post by ridgerunner »

Eddyphp wrote:... Is there a way to include the full url in the returned variable?
Not really sure what you are asking for specifically, but, yes indeed, the full URL is being captured in the first capture group and is available for immediate use. For example:

Code: Select all

$text = '<li><a href="http://www.example.com/Contact/">Contact</a></li></ul>';
if (preg_match('/href="([^"]+)"/', $text, $matches)) {
    $full_url = $matches[1]; // 'http://www.example.com/Contact/'
} else {
    $result = "";
}
The overall match is stored in $matches[0], capture group 1 is stored in $matches[1], capture group 2 is stored in $matches[2], and so on. Is that what you are asking?
User avatar
jackpf
DevNet Resident
Posts: 2119
Joined: Sun Feb 15, 2009 7:22 pm
Location: Ipswich, UK

Re: Confusing regex

Post by jackpf »

That explanation was awesome. I never knew the "lazy star" was less efficient.

:bow:

And yeah...I'm not sure about reading an entire book on regex. I think there are two types of people who use regex. People who just use it, and can "get by", and then there are enthusiasts.

I think I am the former, you are the latter :P Which is why I use the inefficient methods that "work", and you use the efficient methods that excel. Maybe if I get bored one day and have a spare £15, I might have a look ;)

But yeah, kudos to your explanation :)
Post Reply