Page 1 of 1

REGEX questions

Posted: Tue Dec 14, 2010 11:58 am
by paulstanely45
Hello,

Just a quick question for my own curiosity mostly. Hopefully someone can shed light on this.

I have this code

Code: Select all

<?php
$string = "breaks\kristin.txt";


preg_match("/breaks\\\([0-9A-Za-z\-_]+)\.txt/i", $string, $matches);

if(isset($matches[1])) {
echo $matches[1];	
}

?>
Which of course echos out 'kristin', as it should.

My big question is why I need the triple backslash? I thought I would only need two, one back slash to escape the other.

However, if I use this code:

Code: Select all

<?php
$string = "breaks\kristin.txt";


preg_match("/breaks\\([0-9A-Za-z\-_]+)\.txt/i", $string, $matches);

if(isset($matches[1])) {
	echo $matches[1];	
}

?>
I get the following error:

Warning: preg_match(): Compilation failed: unmatched parentheses at offset 23 in /Volumes/DATA1/webserver/sandbox/break.php on line 5


Could someone explain? I would think that only two would be needed, wouldn't 3 backslashes leave one escaped, and another one escaping the opening parenthesis?

Re: REGEX questions

Posted: Tue Dec 14, 2010 12:17 pm
by AbraCadaver
PHP does not require all backslashes in strings to be escaped. If you want to include a backslash as a literal character in a PHP string, you only need to escape it if it is followed by another character that needs to be escaped. So in your example, the ( would need to be escaped if you meant a literal ( so \( would escape it, but when you do \\ the second \ doesn't need to be escaped so the first is treated as a literal \ and the second escapes the (, so to remove the ambiguity you use \\\.

Hope that makes sense.

Re: REGEX questions

Posted: Tue Dec 14, 2010 12:21 pm
by Jonah Bron
Okay, first I'd like to say I was very surprised at this, and for the first few minutes, it totally stumped me... but I think I have it figured out now. The problem is that there's two escapes going on: string, and regex. Here's an example:

Code: Select all

$string = "\\";
echo $string; // output: "\"

$regex = '/\/';
echo $regex; // output: "/\/"
preg_match($regex, $string, $matches); // regex parse error, no ending delimiter found.  \ escaped /

$regex = '/\\/';
echo $regex; // output: "/\/"
preg_match($regex, $string, $matches); // regex parse error, no ending delimiter found.  \ escaped /

$regex = '/\\\/';
echo $regex; // output: "\//\"
preg_match($regex, $string, $matches); // successful, \ escaped \
Do you see how that works now?

@AbraCadaver: I think it's because there's two escapes happening.

Re: REGEX questions

Posted: Wed Dec 15, 2010 10:59 am
by paulstanely45
AbraCadaver wrote:PHP does not require all backslashes in strings to be escaped. If you want to include a backslash as a literal character in a PHP string, you only need to escape it if it is followed by another character that needs to be escaped. So in your example, the ( would need to be escaped if you meant a literal ( so \( would escape it, but when you do \\ the second \ doesn't need to be escaped so the first is treated as a literal \ and the second escapes the (, so to remove the ambiguity you use \\\.

Hope that makes sense.
I think that makes sense.

It seems to me that what needs to be passed to the preg_match function is a string that would actually evaluate to

Code: Select all

"/breaks\\([0-9A-Za-z\-_]+)\.txt/i"
because the regex engine wants to see a double backslash to know that the backslash there is supposed to be a literal. A triple backslash in a PHP string of course evaluates to a literal double backslash.

I think that's right. Please do correct me if I am wrong.

Re: REGEX questions

Posted: Wed Dec 15, 2010 11:05 am
by Jonah Bron
Yes, that's right. You need three because the string escapes it, and then the regex escapes it. Here's the process:

Code: Select all

string input: "/breaks\\\([0-9a-z-_]+)\.txt/i"
Double backslash escapes to one:
        vv
"/breaks\\\([0-9a-z-_]+)\.txt/i"
"/breaks\\([0-9a-z-_]+)\.txt/i"

No more escapes

regex input: "/breaks\\([0-9a-z-_]+)\.txt/i"
double backslash escapes to one:
        vv
"/breaks\\([0-9a-z-_]+)\.txt/i"
"/breaks\([0-9a-z-_]+)\.txt/i"

backslash escapes dot to literal dot
...