Page 1 of 1
REGEX questions
Posted: Tue Dec 14, 2010 11:58 am
by paulstanely45
Hello,
Just a quick question for my own curiosity mostly. Hopefully someone can shed light on this.
I have this code
Code: Select all
<?php
$string = "breaks\kristin.txt";
preg_match("/breaks\\\([0-9A-Za-z\-_]+)\.txt/i", $string, $matches);
if(isset($matches[1])) {
echo $matches[1];
}
?>
Which of course echos out 'kristin', as it should.
My big question is why I need the triple backslash? I thought I would only need two, one back slash to escape the other.
However, if I use this code:
Code: Select all
<?php
$string = "breaks\kristin.txt";
preg_match("/breaks\\([0-9A-Za-z\-_]+)\.txt/i", $string, $matches);
if(isset($matches[1])) {
echo $matches[1];
}
?>
I get the following error:
Warning: preg_match(): Compilation failed: unmatched parentheses at offset 23 in /Volumes/DATA1/webserver/sandbox/break.php on line 5
Could someone explain? I would think that only two would be needed, wouldn't 3 backslashes leave one escaped, and another one escaping the opening parenthesis?
Re: REGEX questions
Posted: Tue Dec 14, 2010 12:17 pm
by AbraCadaver
PHP does not require all backslashes in strings to be escaped. If you want to include a backslash as a literal character in a PHP string, you only need to escape it if it is followed by another character that needs to be escaped. So in your example, the ( would need to be escaped if you meant a literal ( so \( would escape it, but when you do \\ the second \ doesn't need to be escaped so the first is treated as a literal \ and the second escapes the (, so to remove the ambiguity you use \\\.
Hope that makes sense.
Re: REGEX questions
Posted: Tue Dec 14, 2010 12:21 pm
by Jonah Bron
Okay, first I'd like to say I was very surprised at this, and for the first few minutes, it totally stumped me... but I think I have it figured out now. The problem is that there's two escapes going on: string, and regex. Here's an example:
Code: Select all
$string = "\\";
echo $string; // output: "\"
$regex = '/\/';
echo $regex; // output: "/\/"
preg_match($regex, $string, $matches); // regex parse error, no ending delimiter found. \ escaped /
$regex = '/\\/';
echo $regex; // output: "/\/"
preg_match($regex, $string, $matches); // regex parse error, no ending delimiter found. \ escaped /
$regex = '/\\\/';
echo $regex; // output: "\//\"
preg_match($regex, $string, $matches); // successful, \ escaped \
Do you see how that works now?
@AbraCadaver: I think it's because there's two escapes happening.
Re: REGEX questions
Posted: Wed Dec 15, 2010 10:59 am
by paulstanely45
AbraCadaver wrote:PHP does not require all backslashes in strings to be escaped. If you want to include a backslash as a literal character in a PHP string, you only need to escape it if it is followed by another character that needs to be escaped. So in your example, the ( would need to be escaped if you meant a literal ( so \( would escape it, but when you do \\ the second \ doesn't need to be escaped so the first is treated as a literal \ and the second escapes the (, so to remove the ambiguity you use \\\.
Hope that makes sense.
I think that makes sense.
It seems to me that what needs to be passed to the preg_match function is a string that would actually evaluate to
Code: Select all
"/breaks\\([0-9A-Za-z\-_]+)\.txt/i"
because the regex engine wants to see a double backslash to know that the backslash there is supposed to be a literal. A triple backslash in a PHP string of course evaluates to a literal double backslash.
I think that's right. Please do correct me if I am wrong.
Re: REGEX questions
Posted: Wed Dec 15, 2010 11:05 am
by Jonah Bron
Yes, that's right. You need three because the string escapes it, and then the regex escapes it. Here's the process:
Code: Select all
string input: "/breaks\\\([0-9a-z-_]+)\.txt/i"
Double backslash escapes to one:
vv
"/breaks\\\([0-9a-z-_]+)\.txt/i"
"/breaks\\([0-9a-z-_]+)\.txt/i"
No more escapes
regex input: "/breaks\\([0-9a-z-_]+)\.txt/i"
double backslash escapes to one:
vv
"/breaks\\([0-9a-z-_]+)\.txt/i"
"/breaks\([0-9a-z-_]+)\.txt/i"
backslash escapes dot to literal dot
...