Page 1 of 1

Match everything that's not between quotation marks (")

Posted: Tue Oct 14, 2008 1:32 am
by JellyFish
I have the string:

Code: Select all

 
This isn't a string "this is a string" This also isn't a string "More string" This isn't a string
 
I'd like to be able to select everything that's not within quotation marks:

Code: Select all

 
[color=blue][u]This isn't a string [/u][/color]"this is a string"[color=orange][u] This also isn't a string [/u][/color]"More string"[color=blue][u] This isn't a string[/u][/color]
 
with a regular expression. I haven't been able to figure out how to do this yet, and have googled the problem. I have however been able to figure out the opposite match of what I'm looking for:

Code: Select all

/"[^"]*"
matches:

Code: Select all

 
This isn't a string [color=blue][u]"this is a string"[/u][/color] This also isn't a string [color=orange][u]"More string"[/u][/color] This isn't a string
 
If I knew how to invert regular expressions my issue would have been solved already.

This is where I've been testing my regexes: http://regexpal.com/

Also this regex selects everything in around a quote:

Code: Select all

[^"]+

Code: Select all

[color=blue][u]This isn't a string [/u][/color]"[color=orange][u]this is a string[/u][/color]"[color=blue][u] This also isn't a string [/u][/color]"[color=orange][u]More string[/u][/color]"[color=blue][u] This isn't a string[/u][/color]
maybe this could be usefull.

I hope I made myself as clear as I could, and have helped you understand my problem.

Thanks.

Re: Match everything that's not between quotation marks (")

Posted: Tue Oct 14, 2008 2:02 am
by prometheuzz
Try this:

Code: Select all

$text = 'This isn\'t a string "this is a string" This also isn\'t a string "More string" This isn\'t a string';
$regex = '/[^"]+(?=(?:[^"]*"[^"]*"[^"]*)*$)/';
echo "$text\n";
if(preg_match_all($regex, $text, $matches)) {
  print_r($matches);
}
The regex translated to English would (more or less) say: "match one ore more non-quote characters only if there are zero or an even number of quotes to be seen when looking ahead (to the end of the string!!!).".

HTH

Edit: the forum software keeps "eating" the backslash from the backslash-single-quote occurrences in strings...

Re: Match everything that's not between quotation marks (")

Posted: Tue Oct 14, 2008 3:11 pm
by JellyFish
Thank you for your reply prometheuzz. I've modified your regex a bit so that I can place sub-regexes inside of it, so that I can match anything(rather than everything) that's not inbetween quotes.

Code: Select all

 
this part here could be any regex(?=(?:[^"]*"[^"]*"[^"]*|[^"])*$)
 
I just added a little |[^"] so now the expression says: "match the string 'this part here could be any regex' only if there are zero or an even number of quotes to be seen when looking ahead (to the end of the string!!!),". In fact the one you gave me didn't match things if they didn't have quotes after them. For example with the string:

Code: Select all

"string" not string"string" not string

Code: Select all

[^"](?=(?:[^"]*"[^"]*"[^"]*)*$)
would match the first not string part and the last g of the whole string.

Code: Select all

[^"](?=(?:[^"]*"[^"]*"[^"]*|[^"])*$)
would match the first and last not string part


But there is one last issue that didn't become apparent at first. If I had the string:

Code: Select all

var foo = "hello \"world\"";
the quotes after \ would be treated as normal quotes thus allowing world to be matched. The regex should be able to ignore quotes with \ before them. There's the followed by expression, (?=), but I don't think there's a leaded by expression.

I'll have to ponder this and play with regexpal some more.

Re: Match everything that's not between quotation marks (")

Posted: Tue Oct 14, 2008 4:07 pm
by prometheuzz
Assuming the escaped quotes always come in pairs, then this might work for you:

Code: Select all

$text = '"ignore" A A "string" B "ignore" C \"C\" C \"C\" CCC "ignore" DDD';
$regex = '/(?:(?:[^"]*(?<=\\\\)"[^"]*(?<=\\\\)"[^"]*)+|[^"]+)(?=(?:[^"]*"[^"]*"[^"]*)*$)/';
if(preg_match_all($regex, $text, $matches)) {
  print_r($matches);
}
Output:

Code: Select all

Array
(
    [0] => Array
        (
            [0] =>  A A 
            [1] =>  B 
            [2] =>  C \"C\" C \"C\" CCC 
            [3] =>  DDD
        )
)

Re: Match everything that's not between quotation marks (")

Posted: Tue Oct 14, 2008 4:40 pm
by JellyFish
Actually, the escaped quotes don't always come in pairs, rather they come variably—odd, even or none at all.

Also I prefer to be able to do this in JavaScript, which I don't think (?<=) is in JavaScript's regex parser. But if I need to use (?<=) I could write my regexes in php, but I'd really rather have this done on the client-side because of bandwidth issues.

Code: Select all

[^"](?=(?:[^"]*"[color=#FF0000][u][^"]*[/u][/color]"[^"]*)*$)
Since the red part states what could go inbetween the quotes (in this case it's anything but a quotation mark any number of times) I'm thinking that if we change this part to say something like: A backslash followed by a quotation mark and/or anything that's not a quotation mark, any number of times, in any order. I picture this will solve it.

Re: Match everything that's not between quotation marks (")

Posted: Wed Oct 15, 2008 7:14 am
by prometheuzz
I don't think this can be done without negative look behind. So, if JavaScript does not support that, then you're going to have to do it using PHP (or some other language/tool that does support it).

Here's a PHP way:

Code: Select all

if(preg_match_all(
    '/(?:\\\\"|[^"])+(?=(?:(?<!\\\\)"(?:\\\\"|[^"])*(?<!\\\\)"(?:\\\\"|[^"])*)*$)/',
    'A\"A "ignore" B "ignore" \"C "ignore" D\"D\"D "ignore" EEE',
    $result)) {
 
  echo print_r($result);
}
 
/* output:
Array
(
    [0] => Array
        (
            [0] => A\"A 
            [1] =>  B 
            [2] =>  \"C 
            [3] =>  D\"D\"D 
            [4] =>  EEE
        )
 
)
*/

Re: Match everything that's not between quotation marks (")

Posted: Wed Oct 15, 2008 8:12 am
by prometheuzz
Wait, here's a way without negative look behind:

Code: Select all

'/(\\\\"|[^"])+(?=("(\\\\"|[^"])*((?!\\\\).)"(\\\\"|[^"])*)*$)/'
It works in PHP (is produces the same output as in my previous post).

Re: Match everything that's not between quotation marks (")

Posted: Thu Oct 16, 2008 3:48 am
by GeertDD
@prometheuzz: I think your regex would fail on strings that have an escaped backslash in front of an unescaped double quote, following?

Friedl's book has an absolutely excellent part about allowing escaped quotes in quoted strings. He explains it very clearly and points at common pitfalls. See p.196 and p.222 in the 3rd edition.

Re: Match everything that's not between quotation marks (")

Posted: Thu Oct 16, 2008 4:06 am
by prometheuzz
GeertDD wrote:@prometheuzz: I think your regex would fail on strings that have an escaped backslash in front of an unescaped double quote, following?
Yes, you're right. But the OP didn't say anything about needing to escape backslashes. Naturally, that would be something s/he would have mentioned in his/her original post!
;)
GeertDD wrote:Friedl's book has an absolutely excellent part about allowing escaped quotes in quoted strings. He explains it very clearly and points at common pitfalls. See p.196 and p.222 in the 3rd edition.
Can't remember that part exactly. I will definitely re-read it this evening when I get home.
Thanks.