Confusing regex
Moderator: General Moderators
Confusing regex
Hi,
I read the great tutorial in this post however, can someone explain the logic of this expression and how the results differ from the one below?
preg_match('/href="([^"]+)"/', $line, $stringBetweenQuotes);
where $line string contains the following:
<li><a href="/Contact/">Contact</a></li></ul>
preg_match('/href="([\/\w]+)/',$line,$stringBetweenQuotes);
Patrick.
I read the great tutorial in this post however, can someone explain the logic of this expression and how the results differ from the one below?
preg_match('/href="([^"]+)"/', $line, $stringBetweenQuotes);
where $line string contains the following:
<li><a href="/Contact/">Contact</a></li></ul>
preg_match('/href="([\/\w]+)/',$line,$stringBetweenQuotes);
Patrick.
Re: Confusing regex
I believe the first pattern will match anything that isn't a quote. The second will match any word characters.
Re: Confusing regex
When both code samples are run the array element [1] contains /Contact/
The first one seems to take a more complex approach so I was wondering was there any benefit? Also I'm not sure how the first one is logically extracting the string /Contract/
The first one seems to take a more complex approach so I was wondering was there any benefit? Also I'm not sure how the first one is logically extracting the string /Contract/
- ridgerunner
- Forum Contributor
- Posts: 214
- Joined: Sun Jul 05, 2009 10:39 pm
- Location: SLC, UT
Re: Confusing regex
The first one: '[^"]+' simply means: match one or more charaters that are not double quotes" (which includes the '/' slash character and all '\w' word characters). This first regex is much better than the second because it matches any valid URL whereas the second one won't. For example, the first regex matches and captures each of the following valid links:Eddyphp wrote:Hi,
I read the great tutorial in this post however, can someone explain the logic of this expression and how the results differ from the one below?
preg_match('/href="([^"]+)"/', $line, $stringBetweenQuotes);
where $line string contains the following:
<li><a href="/Contact/">Contact</a></li></ul>
preg_match('/href="([\/\w]+)/',$line,$stringBetweenQuotes);
Patrick.
Code: Select all
<li><a href="/Contact/index.html">Contact</a></li></ul>
<li><a href="/Contact/?query1=value1&query2=value2">Contact</a></li></ul>
<li><a href="/Contact/#fragment">Contact</a></li></ul>
<li><a href="http://example.com/Contact/">Contact</a></li></ul>
<li><a href="http://example.com">Contact</a></li></ul>The first regex is no more complex or less efficient than the second. They both use a repetition of a simple character class, although the first uses a negated character class (i.e. it begins with a caret '^' - for more on character classes read this). Also, your second regex does not require a match for the closing double quote and would thus return erroneous partial matches. For example, if you apply the second regex to the five URL examples above, you would capture the following incomplete URLs:
Code: Select all
/Contact/index
/Contact/
/Contact/
http
httpRe: Confusing regex
But surely wouldn't be the same?
Since it's going to stop at the next quote anyway...surely you don't have to specify you want to match anything that isn't a quote, but you can just use a wildcard instead? I may be completely wrong
just thought I'd ask...
Code: Select all
/href\=\"(.*?)\"/iSince it's going to stop at the next quote anyway...surely you don't have to specify you want to match anything that isn't a quote, but you can just use a wildcard instead? I may be completely wrong
Re: Confusing regex
ridgerunner wrote:The first one: '[^"]+' simply means: match one or more charaters that are not double quotes" (which includes the '/' slash character and all '\w' word characters). This first regex is much better than the second because it matches any valid URL whereas the second one won't. For example, the first regex matches and captures each of the following valid links:Eddyphp wrote:Hi,
I read the great tutorial in this post however, can someone explain the logic of this expression and how the results differ from the one below?
preg_match('/href="([^"]+)"/', $line, $stringBetweenQuotes);
where $line string contains the following:
<li><a href="/Contact/">Contact</a></li></ul>
preg_match('/href="([\/\w]+)/',$line,$stringBetweenQuotes);
Patrick.while the second regex fails to completely capture any of them. It fails because it matches only the '/' slash and '\w' word characters and stops once it hits any '.' dot, '?' question mark, '&' ampersand, '#' hash mark, or ':' colon, all of which are perfectly valid (and common) URL characters.Code: Select all
<li><a href="/Contact/index.html">Contact</a></li></ul> <li><a href="/Contact/?query1=value1&query2=value2">Contact</a></li></ul> <li><a href="/Contact/#fragment">Contact</a></li></ul> <li><a href="http://example.com/Contact/">Contact</a></li></ul> <li><a href="http://example.com">Contact</a></li></ul>
The first regex is no more complex or less efficient than the second. They both use a repetition of a simple character class, although the first uses a negated character class (i.e. it begins with a caret '^' - for more on character classes read this). Also, your second regex does not require a match for the closing double quote and would thus return erroneous partial matches. For example, if you apply the second regex to the five URL examples above, you would capture the following incomplete URLs:Hope this helps!Code: Select all
/Contact/index /Contact/ /Contact/ http http
That's very helpful, the negated character class was confusing me. Just a few clarifications, how is the quote next to the ) stopping the search. Also where a search is used e.g. above href=" a pointer seems to be set to the next character in the string / where the extract ([^"]+) starts. Is there a way to include the full url in the returned variable?
- ridgerunner
- Forum Contributor
- Posts: 214
- Joined: Sun Jul 05, 2009 10:39 pm
- Location: SLC, UT
Re: Confusing regex
Yes, this regex is functionally equivalent and will match the exact same text. Both will do the job nicely. The only difference is that this form is not quite as efficient as the '"([^"]+)"' form. To explain why, lets look in detail at how the PCRE regex engine handles matching the following text:jackpf wrote:But surely wouldn'tbe the same? ...Code: Select all
/href\=\"(.*?)\"/i
Code: Select all
<li><a href="/Contact/">Contact</a></li></ul>Lets look at how your lazy star regex handles the matching. The first time the lazy star is encountered, it tries to match nothing at all (it is lazy after all!) The engine saves a backtracking breadcrumb, exits the capturing parentheses then attempts to match the double quote. When this fails, the engine backtracks to the previously saved breadcrumb going back inside the capturing parentheses, then attempts to match the dot to the current character '/'. This matches successfully, but the lazy star (lazy as it is), again gives up, and the regex once again saves a backtracking breadcrumb trail, exits the capturing parentheses, and attempts to match the double quote. This fails and the regex is forced to backtrack to the previously saved breadcrumb, once again going back inside the capturing parentheses. The dot then matches the 'C'. This process of saving breadcrumb trails and backtracking in and out of the capturing parentheses continues until the entire '/Contact/' portion of the string is matched and finally the double quote also matches at which time the regex declares success. In this case, every character match between the double quotes requires the regex to save and restore backtracking information and to repeatedly move in and out of the capturing parentheses, which incurs memory and clock cycle overhead.
Now lets look at this improved regex:
Code: Select all
/href="([^"]++)"/iWith this simple case (and small subject text), the efficiency difference between the two regex styles is negligible. However, it is always best to specify precisely what you want because this really pays off (time, $$$ and bandwidth) when you have a really big job to do, or if the regex is applied many many times over and over again (such as in the PHP code inside a forum software parser or within a .htaccess apache config file!) Once again, I highly recommend reading Friedl's Mastering Regular Expressions 3rd Edition - its chapters on the details of the regex engine and efficiency issues surely opened my eyes!
- ridgerunner
- Forum Contributor
- Posts: 214
- Joined: Sun Jul 05, 2009 10:39 pm
- Location: SLC, UT
Re: Confusing regex
Not really sure what you are asking for specifically, but, yes indeed, the full URL is being captured in the first capture group and is available for immediate use. For example:Eddyphp wrote:... Is there a way to include the full url in the returned variable?
Code: Select all
$text = '<li><a href="http://www.example.com/Contact/">Contact</a></li></ul>';
if (preg_match('/href="([^"]+)"/', $text, $matches)) {
$full_url = $matches[1]; // 'http://www.example.com/Contact/'
} else {
$result = "";
}Re: Confusing regex
That explanation was awesome. I never knew the "lazy star" was less efficient.

And yeah...I'm not sure about reading an entire book on regex. I think there are two types of people who use regex. People who just use it, and can "get by", and then there are enthusiasts.
I think I am the former, you are the latter
Which is why I use the inefficient methods that "work", and you use the efficient methods that excel. Maybe if I get bored one day and have a spare £15, I might have a look 
But yeah, kudos to your explanation
And yeah...I'm not sure about reading an entire book on regex. I think there are two types of people who use regex. People who just use it, and can "get by", and then there are enthusiasts.
I think I am the former, you are the latter
But yeah, kudos to your explanation