jackpf wrote:But surely wouldn't
be the same? ...
Yes, this regex is functionally equivalent and will match the exact same text. Both will do the job nicely. The only difference is that this form is not quite as efficient as the '"([^"]+)"' form. To explain why, lets look in detail at how the PCRE regex engine handles matching the following text:
Code: Select all
<li><a href="/Contact/">Contact</a></li></ul>
The first literal char in the regex 'h' is compared to the first char in the string '<'. This does not match so the regex engine transmission "bumps-along" to the next char in the string 'l'. This also does not match. This failure to match and bump-along continues until the string pointer reaches the position of the 'href'. At this point the first char of the regex does match, so the regex attempts to match its second literal char 'r' to the 'r' in the string. This, too, matches. This continues until the literal string of the regex 'href="' matches the string text. So far so good. Then the regex encounters the opening parentheses and begins a new capture group (and this is where the two different regex forms diverge).
Lets look at how your lazy star regex handles the matching. The first time the lazy star is encountered, it tries to match nothing at all (it is lazy after all!) The engine saves a backtracking breadcrumb, exits the capturing parentheses then attempts to match the double quote. When this fails, the engine backtracks to the previously saved breadcrumb going back inside the capturing parentheses, then attempts to match the dot to the current character '/'. This matches successfully, but the lazy star (lazy as it is), again gives up, and the regex once again saves a backtracking breadcrumb trail, exits the capturing parentheses, and attempts to match the double quote. This fails and the regex is forced to backtrack to the previously saved breadcrumb, once again going back inside the capturing parentheses. The dot then matches the 'C'. This process of saving breadcrumb trails and backtracking in and out of the capturing parentheses continues until the entire '/Contact/' portion of the string is matched and finally the double quote also matches at which time the regex declares success. In this case, every character match between the double quotes requires the regex to save and restore backtracking information and to repeatedly move in and out of the capturing parentheses, which incurs memory and clock cycle overhead.
Now lets look at this improved regex:
Once inside the parentheses, this regex repeatedly attempts (via the possessive plus '++' quantifier) to match '[^"]' to the string. This matches the '/' char just fine. Then it matches the 'C' and then the 'o' and the 'n' and the 't' and so on until it encounters the double quote where it fails to match. At this point the '++' quantifier has done its job successfully matching one or more characters, so the regex then moves on and finally exits the parentheses. It then matches the literal double quote and declares overall success. In this case the capturing parentheses are entered and exited only once and no backtracking was necessary at all. In this case, between the double quotes we specify exactly what we want to match:
any character that is not a double quote and we are rewarded with efficiency by doing so.
With this simple case (and small subject text), the efficiency difference between the two regex styles is negligible. However, it is always best to specify precisely what you want because this really pays off (time, $$$ and bandwidth) when you have a really big job to do, or if the regex is applied many many times over and over again (such as in the PHP code inside a forum software parser or within a .htaccess apache config file!) Once again, I highly recommend reading Friedl's
Mastering Regular Expressions 3rd Edition - its chapters on the details of the regex engine and efficiency issues surely opened my eyes!