Match everything that's not between quotation marks (")

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
JellyFish
DevNet Resident
Posts: 1361
Joined: Tue Feb 14, 2006 7:18 pm
Location: San Diego, CA

Match everything that's not between quotation marks (")

Post by JellyFish »

I have the string:

Code: Select all

 
This isn't a string "this is a string" This also isn't a string "More string" This isn't a string
 
I'd like to be able to select everything that's not within quotation marks:

Code: Select all

 
[color=blue][u]This isn't a string [/u][/color]"this is a string"[color=orange][u] This also isn't a string [/u][/color]"More string"[color=blue][u] This isn't a string[/u][/color]
 
with a regular expression. I haven't been able to figure out how to do this yet, and have googled the problem. I have however been able to figure out the opposite match of what I'm looking for:

Code: Select all

/"[^"]*"
matches:

Code: Select all

 
This isn't a string [color=blue][u]"this is a string"[/u][/color] This also isn't a string [color=orange][u]"More string"[/u][/color] This isn't a string
 
If I knew how to invert regular expressions my issue would have been solved already.

This is where I've been testing my regexes: http://regexpal.com/

Also this regex selects everything in around a quote:

Code: Select all

[^"]+

Code: Select all

[color=blue][u]This isn't a string [/u][/color]"[color=orange][u]this is a string[/u][/color]"[color=blue][u] This also isn't a string [/u][/color]"[color=orange][u]More string[/u][/color]"[color=blue][u] This isn't a string[/u][/color]
maybe this could be usefull.

I hope I made myself as clear as I could, and have helped you understand my problem.

Thanks.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Match everything that's not between quotation marks (")

Post by prometheuzz »

Try this:

Code: Select all

$text = 'This isn\'t a string "this is a string" This also isn\'t a string "More string" This isn\'t a string';
$regex = '/[^"]+(?=(?:[^"]*"[^"]*"[^"]*)*$)/';
echo "$text\n";
if(preg_match_all($regex, $text, $matches)) {
  print_r($matches);
}
The regex translated to English would (more or less) say: "match one ore more non-quote characters only if there are zero or an even number of quotes to be seen when looking ahead (to the end of the string!!!).".

HTH

Edit: the forum software keeps "eating" the backslash from the backslash-single-quote occurrences in strings...
User avatar
JellyFish
DevNet Resident
Posts: 1361
Joined: Tue Feb 14, 2006 7:18 pm
Location: San Diego, CA

Re: Match everything that's not between quotation marks (")

Post by JellyFish »

Thank you for your reply prometheuzz. I've modified your regex a bit so that I can place sub-regexes inside of it, so that I can match anything(rather than everything) that's not inbetween quotes.

Code: Select all

 
this part here could be any regex(?=(?:[^"]*"[^"]*"[^"]*|[^"])*$)
 
I just added a little |[^"] so now the expression says: "match the string 'this part here could be any regex' only if there are zero or an even number of quotes to be seen when looking ahead (to the end of the string!!!),". In fact the one you gave me didn't match things if they didn't have quotes after them. For example with the string:

Code: Select all

"string" not string"string" not string

Code: Select all

[^"](?=(?:[^"]*"[^"]*"[^"]*)*$)
would match the first not string part and the last g of the whole string.

Code: Select all

[^"](?=(?:[^"]*"[^"]*"[^"]*|[^"])*$)
would match the first and last not string part


But there is one last issue that didn't become apparent at first. If I had the string:

Code: Select all

var foo = "hello \"world\"";
the quotes after \ would be treated as normal quotes thus allowing world to be matched. The regex should be able to ignore quotes with \ before them. There's the followed by expression, (?=), but I don't think there's a leaded by expression.

I'll have to ponder this and play with regexpal some more.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Match everything that's not between quotation marks (")

Post by prometheuzz »

Assuming the escaped quotes always come in pairs, then this might work for you:

Code: Select all

$text = '"ignore" A A "string" B "ignore" C \"C\" C \"C\" CCC "ignore" DDD';
$regex = '/(?:(?:[^"]*(?<=\\\\)"[^"]*(?<=\\\\)"[^"]*)+|[^"]+)(?=(?:[^"]*"[^"]*"[^"]*)*$)/';
if(preg_match_all($regex, $text, $matches)) {
  print_r($matches);
}
Output:

Code: Select all

Array
(
    [0] => Array
        (
            [0] =>  A A 
            [1] =>  B 
            [2] =>  C \"C\" C \"C\" CCC 
            [3] =>  DDD
        )
)
User avatar
JellyFish
DevNet Resident
Posts: 1361
Joined: Tue Feb 14, 2006 7:18 pm
Location: San Diego, CA

Re: Match everything that's not between quotation marks (")

Post by JellyFish »

Actually, the escaped quotes don't always come in pairs, rather they come variably—odd, even or none at all.

Also I prefer to be able to do this in JavaScript, which I don't think (?<=) is in JavaScript's regex parser. But if I need to use (?<=) I could write my regexes in php, but I'd really rather have this done on the client-side because of bandwidth issues.

Code: Select all

[^"](?=(?:[^"]*"[color=#FF0000][u][^"]*[/u][/color]"[^"]*)*$)
Since the red part states what could go inbetween the quotes (in this case it's anything but a quotation mark any number of times) I'm thinking that if we change this part to say something like: A backslash followed by a quotation mark and/or anything that's not a quotation mark, any number of times, in any order. I picture this will solve it.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Match everything that's not between quotation marks (")

Post by prometheuzz »

I don't think this can be done without negative look behind. So, if JavaScript does not support that, then you're going to have to do it using PHP (or some other language/tool that does support it).

Here's a PHP way:

Code: Select all

if(preg_match_all(
    '/(?:\\\\"|[^"])+(?=(?:(?<!\\\\)"(?:\\\\"|[^"])*(?<!\\\\)"(?:\\\\"|[^"])*)*$)/',
    'A\"A "ignore" B "ignore" \"C "ignore" D\"D\"D "ignore" EEE',
    $result)) {
 
  echo print_r($result);
}
 
/* output:
Array
(
    [0] => Array
        (
            [0] => A\"A 
            [1] =>  B 
            [2] =>  \"C 
            [3] =>  D\"D\"D 
            [4] =>  EEE
        )
 
)
*/
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Match everything that's not between quotation marks (")

Post by prometheuzz »

Wait, here's a way without negative look behind:

Code: Select all

'/(\\\\"|[^"])+(?=("(\\\\"|[^"])*((?!\\\\).)"(\\\\"|[^"])*)*$)/'
It works in PHP (is produces the same output as in my previous post).
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Match everything that's not between quotation marks (")

Post by GeertDD »

@prometheuzz: I think your regex would fail on strings that have an escaped backslash in front of an unescaped double quote, following?

Friedl's book has an absolutely excellent part about allowing escaped quotes in quoted strings. He explains it very clearly and points at common pitfalls. See p.196 and p.222 in the 3rd edition.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Match everything that's not between quotation marks (")

Post by prometheuzz »

GeertDD wrote:@prometheuzz: I think your regex would fail on strings that have an escaped backslash in front of an unescaped double quote, following?
Yes, you're right. But the OP didn't say anything about needing to escape backslashes. Naturally, that would be something s/he would have mentioned in his/her original post!
;)
GeertDD wrote:Friedl's book has an absolutely excellent part about allowing escaped quotes in quoted strings. He explains it very clearly and points at common pitfalls. See p.196 and p.222 in the 3rd edition.
Can't remember that part exactly. I will definitely re-read it this evening when I get home.
Thanks.
Post Reply