Page 1 of 1

Matching whitespace within quotes

Posted: Wed Feb 11, 2009 12:19 am
by tristanlee85
Here's what's up. I am trying to match any white space within any single or double quotes. Take this string for example:

Code: Select all

the quick "brown fox" jumped
My function is going to replace the white spaces within the quotes with a hyphen. So in the end it will look like:

Code: Select all

the quick "brown-fox" jumped
All I need to know it how do I match those spaces within the quotes. I've worked it out as best as I can understand and I've come up with this:

Code: Select all

"[^"]+"|[\s]+
and that gives me:

Code: Select all

the-quick---jumped

Re: Matching whitespace within quotes

Posted: Wed Feb 11, 2009 2:26 am
by prometheuzz
It's a bit of a tricky one:

Code: Select all

$text = "The quick \"brown fox\" jumped 'over something' and \"another brown fox\" jumped.";
echo "{$text}\n";
echo preg_replace("/\s+(?!([^'\"]*['\"][^'\"]*['\"])*[^'\"]*$)/", '-', $text);
Words of caution:
- it will go wrong if you have "unbalanced" quotes in your text (like this string: this "is a single quote: ' ok?" and that's it);
- performance may decrease on larger strings (properly test this if this is the case!).

If, after glancing at that regex, you don't exactly know how (or why) it works, feel free to post back and I'll give you an explanation.

Good luck!

Re: Matching whitespace within quotes

Posted: Wed Feb 11, 2009 11:59 am
by tristanlee85
Thank you for the reply! I wasn't going to go as far to look into balancing the quotes. I mean, Google doesn't automatically split my words when I type with without a space.

This works just like I was hoping. I'll take you up on your offer for explaining it if you would like. I've read tutorial after tutorial and RegEx is something I can't understand.

Re: Matching whitespace within quotes

Posted: Thu Feb 12, 2009 3:15 am
by prometheuzz
tristanlee85 wrote:Thank you for the reply! I wasn't going to go as far to look into balancing the quotes. I mean, Google doesn't automatically split my words when I type with without a space.

This works just like I was hoping. I'll take you up on your offer for explaining it if you would like. I've read tutorial after tutorial and RegEx is something I can't understand.
Hmmm, if you're not really comfortable with regex-es, I don't know if you're going
to grasp this fully. But I'll give it a shot:

So, this is the regex:

Code: Select all

'/\s+(?!([^'\"]*['\"][^'\"]*['\"])*[^'\"]*$)/'
 
In plain English, it would read like this:

Match one or more successive white space characters, only if those characters
DON'T have an even number of single- or double quotes in front of it all the way
to the end of the string.


A couple of basics:

Code: Select all

\s      // Mathces a single white space character
 
X+      // One or more 'X'-s 
 
X*      // Zero or more 'X'-s
 
[XY]    // Matches either 'X' or 'Y'
 
[^XY]   // Matches any character except 'X' and 'Y'
 
X(?!Y)  // Match the character 'X' only if there isn't a 'Y' ahead of it (so, 
        // it matches 'XQ' and 'XC' etc. but does not match 'XY'). This is 
        // called: 'negative look-ahead'.
 
$       // Meta character for the 'end of the string'
So, with the explanation above, you can piece together my original regex, here's
what it does:

Code: Select all

\s+            // Match one or more white space characters ...
 
(?!            // start negative look-ahead
 
  (            //   open group 1
 
    [^'\"]*    //     zero or more characters of any type except single or double quotes
 
    ['\"]      //     one single or double quote
 
    [^'\"]*    //     zero or more characters of any type except single or double quotes
 
    ['\"]      //     one single or double quote
 
  )            //   close group 1
 
  *            //   group 1 can occur zero or more times (in other words, quotes 
               //   can only occur 0, 2, 4, 6, .. times, ie an even number of times)
 
  [^'\"]*      //   zero or more characters of any type except single or double quotes  
 
  $            // the end of the string
 
)              // stop negative look-ahead
 
The key lies in the fact that the end-of-string meta-character is anchored inside
the look-ahead. Removing that will cause the regex to match any white space that
has at least 2 quotes in front of it (so also white spaces with 3, 5, 7, ... quotes
in front of it).

But again: it's a tricky regex, so don't feel too bad if you don't fully grasp it (yet).

Good luck!