Page 1 of 2
Match ',' of which is not inside ' [ ] '
Posted: Tue May 05, 2009 10:02 am
by Glycerine
Hey,
I'm crap with regex - its only due to promethuzz I'm not fired!
I have this data:
Code: Select all
this,is,some,[info,about,within,data,bracket],other
and I want it split to:
Code: Select all
this
is
some
[info,about,within,data,bracket]
other
using PHP. I can match commas but making sure their not within square brackets is tough.
Thanks in advance.
P.S Know and good books about regex. Finding one that doesn't boggle my mind is rough.
Re: Match ',' of which is not inside ' [ ] '
Posted: Tue May 05, 2009 12:11 pm
by prometheuzz
Okay, let's turn this into a little regex training. To solve this, your regex should do the following:
Match a comma, but only if it doesn't have a ']' in front of it with zero or more characters other than a '[' in between that comma and the ']'.
First see if the above statement is really true for the comma's you want to split on. If it is, try to translate that into a regex.
Feel free to post back when you run into problems of course!
Oh, and
Mastering regular expressions by Jeffrey Friedl is an absolute must-have. And it's well written making it understandable for people (fairly) new to regex.
Re: Match ',' of which is not inside ' [ ] '
Posted: Tue May 05, 2009 12:43 pm
by Glycerine
so - English / Regex pseudo =
Right - so working backwards right!?
conditional match: , IF not (positive lookahead '.+') ] THEN match comma ELSE dont match - --- then something to do with Greedy something..? K - I'm going to have a go.
Re: Match ',' of which is not inside ' [ ] '
Posted: Tue May 05, 2009 12:59 pm
by prometheuzz
Glycerine wrote:so - English / Regex pseudo =
Right - so working backwards right!?
conditional match: , IF not (positive lookahead '.+') ] THEN match comma ELSE dont match - --- then something to do with Greedy something..? K - I'm going to have a go.
More or less, yes, but not quite (if I understand you correctly). Here's what's needed for my proposed regex:
Code: Select all
Match a comma, but only if it doesn't have a ']' in front of it with zero or more characters other than a '[' in between that comma and the ']'
| | | | | | | | | |
+-----------+ | +-+ +----------+ +-------------------------+ |
A | C D E |
| |
+------------------------------------------------------------------------------------------------------------------------------+
B
A - just a regular match
B - negative look ahead
C - just a regular match
D - greedy operator
E - the negated character class before the greedy operator from C
Re: Match ',' of which is not inside ' [ ] '
Posted: Tue May 05, 2009 2:36 pm
by Glycerine
So an hours worth of faffing about:
any good so far?
Re: Match ',' of which is not inside ' [ ] '
Posted: Tue May 05, 2009 2:44 pm
by prometheuzz
Glycerine wrote:So an hours worth of faffing about:
any good so far?
That is probably a rhetorical question. So, you already know the answer to that: no, not good.
Question/remarks:
- what do you think (?: ... ) (ignore the three dots) does?
- inside a character class, you need to escape the square brackets;
- where's the greedy operator I recommended you to use?
You can also think of your problem the other way around, so instead of thinking
"on what comma's do I need to split on?" you can ask yourself
"on what comma's will I NOT split on?". To answer that last question, one could answer like this:
You ignore the comma's that have zero or more characters (a) other than an opening bracket (b) directly followed by a closing bracket (c) in front of it (d)
And there you have your regex building blocks again (perhaps now in a different order):
a - greedy operator
b - negated character class
c - a simple character match
d - look ahead
And the word "ignore" applies to the form of "look ahead" mentioned in point 'd'.
Re: Match ',' of which is not inside ' [ ] '
Posted: Tue May 05, 2009 3:03 pm
by Glycerine
You sir speak entire swahili to me! HA - and it wasn't a rhetorical question
Anyhoo - the (?: ... ) thing is a conditional isnt it? or sort of 'match the expression (three dots. )
The character class - I thought I did escape, n by doing ' [^][] ' I thought I was saying "don't match one of the square brackets".
And a greedy operator is " .+ " or " .? " isn't it.
--
So you question can be interpreted as
"Don't split a comma if your between ] or [" using that negative lookahead?
I'll have another go.
And this time I'm giving you something - I have a bunch of books lying about the office. What do you want? (knowing you know the answer and your just dangling me by a string

)
Re: Match ',' of which is not inside ' [ ] '
Posted: Tue May 05, 2009 3:25 pm
by prometheuzz
Glycerine wrote:You sir speak entire swahili to me! HA - and it wasn't a rhetorical question
Anyhoo - the (?: ... ) thing is a conditional isnt it? or sort of 'match the expression (three dots. )
Okay, but no, the (?: ... ) is called a non-capturing group (details about it can be found here:
http://www.regular-expressions.info/named.html). You don't need it. What you do need is negative look ahead, check out this one:
http://www.regular-expressions.info/lookaround.html (which is what you probably meant by "conditional" I presume).
Glycerine wrote:The character class - I thought I did escape, n by doing ' [^][] ' I thought I was saying "don't match one of the square brackets".
But how does the regex engine know what the closing bracket of your character class is? If you want to include a square bracket in your character class, you need to escape it with a back slash.
More info:
http://www.regular-expressions.info/charclass.html
Glycerine wrote:And a greedy operator is " .+ " or " .? " isn't it.
Only the +, not the ?.
More info:
http://www.regular-expressions.info/repeat.html
Glycerine wrote:So you question can be interpreted as "Don't split a comma if your between ] or [" using that negative lookahead?
No, don't split on a comma when you can look ahead from that comma to a ']' without encountering a '['.
Glycerine wrote:I'll have another go.
And this time I'm giving you something - I have a bunch of books lying about the office. What do you want? (knowing you know the answer and your just dangling me by a string

)
Learn a man to fish instead of giving him one, and all that...
Really, if you were a bit closer to the solution, I would have shed some more light on it, but you're now simply guessing. ; )
I thought it was a rhetorical question because the answer to it could have easily been gotten by running a little test. So try a couple of things before posting a follow up. You'll be amazed by how much you learn by trial and error!
I'm off to bed right now, but I'll look at this thread tomorrow.
Best of luck, grasshopper!
; )
Re: Match ',' of which is not inside ' [ ] '
Posted: Tue May 05, 2009 3:45 pm
by Glycerine
MASTER - DON'T LEAVE ME... I HATE FISH!!!
crap...
All right then - Guess I'll have to bribe another genius... or actually figure it out...
Re: Match ',' of which is not inside ' [ ] '
Posted: Wed May 06, 2009 12:36 am
by prometheuzz
Glycerine wrote:... or actually figure it out...
Can you post what you have so far after reading the links from my previous reply?
Re: Match ',' of which is not inside ' [ ] '
Posted: Wed May 06, 2009 3:48 am
by Glycerine
My go:
my Test:
Code: Select all
this,is,some,[info,about,within,data,bracket],other
kinda nearly there...
Re: Match ',' of which is not inside ' [ ] '
Posted: Wed May 06, 2009 4:26 am
by prometheuzz
Glycerine wrote:My go:
my Test:
Code: Select all
this,is,some,[info,about,within,data,bracket],other
kinda nearly there...
Closer, indeed, but you're making it more difficult than it is. ; )
Have a look at this:
Code: Select all
$text = 'this,is,some,[info,about,within,data,bracket],other';
print_r(preg_split('/,(?![^\[]*])/', $text));
As you can see, the closing bracket ']' is not a special character outside a character set, so you don't need to escape it.
And to complicate it a bit, in this case you don't need to escape the opening bracket inside the character set, ie this:
will work as well. But I
did escape it because I think it makes the regex more readable since in most cases such square brackets need escaping. And sometimes escaping and sometimes not escaping is IMHO confusing, so I always escape the brackets in side a character set.
A (short) explanation:
Code: Select all
, // match a comma
(?! // start negative look-ahead
[^\[] // match a single character other than '['
* // the previous character set, zero or more times
] // match a ']'
) // stop negative look-ahead
Re: Match ',' of which is not inside ' [ ] '
Posted: Wed May 06, 2009 3:05 pm
by Glycerine
Nice one buddy. I've just looked at it.
I'll get that book you suggested (stops me from bugging you)
what would you like in return? want a Flex book? AS3 animation? something along those lines? You like autodesk?
Re: Match ',' of which is not inside ' [ ] '
Posted: Wed May 06, 2009 3:19 pm
by prometheuzz
Glycerine wrote:Nice one buddy. I've just looked at it.
I'll get that book you suggested (stops me from bugging you)
what would you like in return? want a Flex book? AS3 animation? something along those lines? You like autodesk?
Your gratitude is enough, really. But thank you for you offer.
Besides, I never do anything remotely related to graphics (2D or 3D) or web programming.
Best of luck with
Mastering Regular Expression, Jay. It definitely is a good read!
Re: Match ',' of which is not inside ' [ ] '
Posted: Wed May 06, 2009 5:26 pm
by Glycerine
Video?
What do you do - What you interested in?