Matching whitespace within quotes

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
tristanlee85
Forum Contributor
Posts: 172
Joined: Fri Dec 19, 2003 7:28 am

Matching whitespace within quotes

Post by tristanlee85 »

Here's what's up. I am trying to match any white space within any single or double quotes. Take this string for example:

Code: Select all

the quick "brown fox" jumped
My function is going to replace the white spaces within the quotes with a hyphen. So in the end it will look like:

Code: Select all

the quick "brown-fox" jumped
All I need to know it how do I match those spaces within the quotes. I've worked it out as best as I can understand and I've come up with this:

Code: Select all

"[^"]+"|[\s]+
and that gives me:

Code: Select all

the-quick---jumped
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Matching whitespace within quotes

Post by prometheuzz »

It's a bit of a tricky one:

Code: Select all

$text = "The quick \"brown fox\" jumped 'over something' and \"another brown fox\" jumped.";
echo "{$text}\n";
echo preg_replace("/\s+(?!([^'\"]*['\"][^'\"]*['\"])*[^'\"]*$)/", '-', $text);
Words of caution:
- it will go wrong if you have "unbalanced" quotes in your text (like this string: this "is a single quote: ' ok?" and that's it);
- performance may decrease on larger strings (properly test this if this is the case!).

If, after glancing at that regex, you don't exactly know how (or why) it works, feel free to post back and I'll give you an explanation.

Good luck!
tristanlee85
Forum Contributor
Posts: 172
Joined: Fri Dec 19, 2003 7:28 am

Re: Matching whitespace within quotes

Post by tristanlee85 »

Thank you for the reply! I wasn't going to go as far to look into balancing the quotes. I mean, Google doesn't automatically split my words when I type with without a space.

This works just like I was hoping. I'll take you up on your offer for explaining it if you would like. I've read tutorial after tutorial and RegEx is something I can't understand.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Matching whitespace within quotes

Post by prometheuzz »

tristanlee85 wrote:Thank you for the reply! I wasn't going to go as far to look into balancing the quotes. I mean, Google doesn't automatically split my words when I type with without a space.

This works just like I was hoping. I'll take you up on your offer for explaining it if you would like. I've read tutorial after tutorial and RegEx is something I can't understand.
Hmmm, if you're not really comfortable with regex-es, I don't know if you're going
to grasp this fully. But I'll give it a shot:

So, this is the regex:

Code: Select all

'/\s+(?!([^'\"]*['\"][^'\"]*['\"])*[^'\"]*$)/'
 
In plain English, it would read like this:

Match one or more successive white space characters, only if those characters
DON'T have an even number of single- or double quotes in front of it all the way
to the end of the string.


A couple of basics:

Code: Select all

\s      // Mathces a single white space character
 
X+      // One or more 'X'-s 
 
X*      // Zero or more 'X'-s
 
[XY]    // Matches either 'X' or 'Y'
 
[^XY]   // Matches any character except 'X' and 'Y'
 
X(?!Y)  // Match the character 'X' only if there isn't a 'Y' ahead of it (so, 
        // it matches 'XQ' and 'XC' etc. but does not match 'XY'). This is 
        // called: 'negative look-ahead'.
 
$       // Meta character for the 'end of the string'
So, with the explanation above, you can piece together my original regex, here's
what it does:

Code: Select all

\s+            // Match one or more white space characters ...
 
(?!            // start negative look-ahead
 
  (            //   open group 1
 
    [^'\"]*    //     zero or more characters of any type except single or double quotes
 
    ['\"]      //     one single or double quote
 
    [^'\"]*    //     zero or more characters of any type except single or double quotes
 
    ['\"]      //     one single or double quote
 
  )            //   close group 1
 
  *            //   group 1 can occur zero or more times (in other words, quotes 
               //   can only occur 0, 2, 4, 6, .. times, ie an even number of times)
 
  [^'\"]*      //   zero or more characters of any type except single or double quotes  
 
  $            // the end of the string
 
)              // stop negative look-ahead
 
The key lies in the fact that the end-of-string meta-character is anchored inside
the look-ahead. Removing that will cause the regex to match any white space that
has at least 2 quotes in front of it (so also white spaces with 3, 5, 7, ... quotes
in front of it).

But again: it's a tricky regex, so don't feel too bad if you don't fully grasp it (yet).

Good luck!
Post Reply