Page 1 of 1
preg_match_all().. empty values in $matches
Posted: Thu Jan 22, 2009 12:07 am
by s.dot
I'm using the following pattern which wants to catch certain strings between different delimiters..
Code: Select all
$pattern = '/(' . $char1Start . '(.+?)' . $char1End . '|' . $char2Start . '(.+?)' . $char2End . ')/ism
Which would give me something like this:
Code: Select all
/(\[#\](.+?)\[\/#\]|\[*\](.+?)[\/*\])/ism
Then I use preg_match_all($pattern, $text, $matches);
The problem is I'm using the | (or) character. and if the second condition is met, I get empty array values in $matches for the first () and (.+?) that comes before the |.
Do I have to live with this and just array_filter() $matches when I'm done? Cuz I have a lot of empty values in my $matches array since I'm doing like 30 different |'s.
Re: preg_match_all().. empty values in $matches
Posted: Thu Jan 22, 2009 12:17 am
by prometheuzz
Can you also post the string that produces these empty entries?
Re: preg_match_all().. empty values in $matches
Posted: Thu Jan 22, 2009 12:53 am
by s.dot
Sure. I'm making a bbcode parser
My full pattern ends up being:
Code: Select all
/(\[b\](.+?)\[\/b\]|\[u\](.+?)\[\/u\]|\[i\](.+?)\[\/i\]|\[s\](.+?)\[\/s\]|\[img\](.+?)\[\/img\]|\[center\](.+?)\[\/center\]|\[marquee\](.+?)\[\/marquee\]|\[blink\](.+?)\[\/blink\]|\[size=(.+?)\](.+?)\[\/size\]|\[color=(.+?)\](.+?)\[\/color\]|\[url(=.+?)?\](.+?)\[\/url\]|\[quote(=.+?)?\](.+?)\[\/quote\])/ism
The text I'm using to match upon:
Code: Select all
[ b ]hi![ /b ] what\'s up with [ u ]you[ /u ], [ blink ]dude[ /blink ]? [ size=3 ]ok write me back[ /size ] [ quote ]something[ /quote ] [ quote=scott ]something else[ /quote ]
And the results
Code: Select all
Array
(
[0] => Array
(
[0] => [ b ]hi![ /b ]
[1] => [ b ]hi![ /b ]
[2] => hi!
)
[1] => Array
(
[0] => [ u ]you[ /u ]
[1] => [ u ]you[ /u ]
[2] =>
[3] => you
)
[2] => Array
(
[0] => [ blink ]dude[ /blink ]
[1] => [ blink ]dude[ /blink ]
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] => dude
)
[3] => Array
(
[0] => [ size=3 ]ok write me back[ /size ]
[1] => [ size=3 ]ok write me back[ /size ]
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] => 3
[11] => ok write me back
)
[4] => Array
(
[0] => [ quote ]something[ /quote ]
[1] => [ quote ]something[ /quote ]
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] =>
[14] =>
[15] =>
[16] =>
[17] => something
)
[5] => Array
(
[0] => [ quote=scott ]something else[ /quote ]
[1] => [ quote=scott ]something else[ /quote ]
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] =>
[14] =>
[15] =>
[16] => =scott
[17] => something else
)
)
I am using PREG_SET_ORDER.
EDIT| I had to space the bbcode out or else the forum would parse it.
Re: preg_match_all().. empty values in $matches
Posted: Thu Jan 22, 2009 3:42 am
by prometheuzz
scottayy wrote:Sure. I'm making a bbcode parser
My full pattern ends up being:
Code: Select all
/(\[b\](.+?)\[\/b\]|\[u\](.+?)\[\/u\]|\[i\](.+?)\[\/i\]|\[s\](.+?)\[\/s\]|\[img\](.+?)\[\/img\]|\[center\](.+?)\[\/center\]|\[marquee\](.+?)\[\/marquee\]|\[blink\](.+?)\[\/blink\]|\[size=(.+?)\](.+?)\[\/size\]|\[color=(.+?)\](.+?)\[\/color\]|\[url(=.+?)?\](.+?)\[\/url\]|\[quote(=.+?)?\](.+?)\[\/quote\])/ism
...
Okay, the reason you're getting empty strings in your $matches is because of (sub) regex-es like these:
(=.+?)?
Since you make them reluctnat, there can be times that that specific (sub) regex does not match a part of your string. When that occurs, you will end up with an empty string in your $matches. There's no way around that.
A couple of observations about your current approach:
- creating a parser solely using regex is going to be hard since the recursive nature of many languages/grammars;
- there's no need to start and end your regex with parenthesis;
- cramming your entire regex pattern in one huge string is going to be a maintenance nightmare, at least use the x-modifier and divide your sub-regex-es on separate lines and indent is nicely;
- since you're also matching for the slashes in your pattern, use a different delimiter for your regex. Like the character '@'.
Something like this:
Code: Select all
$regex = '@
\[b\] (.+?) \[/b\]
| \[u\] (.+?) \[/u\]
| \[i\] (.+?) \[/i\]
| \[s\] (.+?) \[/s\]
| \[img\] (.+?) \[/img\]
| \[center\] (.+?) \[/center\]
| \[marquee\] (.+?) \[/marquee\]
| \[blink\] (.+?) \[/blink\]
| \[size=(.+?)\] (.+?) \[/size\]
| \[color=(.+?)\] (.+?) \[/color\]
| \[url(=.+?)?\] (.+?) \[/url\]
| \[quote(=.+?)?\] (.+?) \[/quote\]
@isx'; // no need for the m-modifier
Re: preg_match_all().. empty values in $matches
Posted: Thu Jan 22, 2009 11:07 am
by s.dot
The pattern is dynamically generated so maintenance isn't an issue.
So basically, using this approach there's no way to avoid the empty matches. I use array_map('array_filter', $matches); to remove the empty entries but the keys aren't renumbered. Is there an easy way to renumber array keys?
Re: preg_match_all().. empty values in $matches
Posted: Thu Jan 22, 2009 11:52 am
by prometheuzz
You could match the two "types" of matches in two steps:
http://pastebin.com/f424fa913 (externally posted because of the forum eating up the tags)
Re: preg_match_all().. empty values in $matches
Posted: Thu Jan 22, 2009 4:19 pm
by s.dot
There's actually 3 types.
[ tag ]
[ tag=neededvaluehere ]
[ tag=optionalvaluehere ]
But looking at your regex example is very helpful! I had tried using $1 and it didn't work for me.. i guess \1 is what I was looking for.
Re: preg_match_all().. empty values in $matches
Posted: Fri Jan 23, 2009 1:12 am
by prometheuzz
scottayy wrote:There's actually 3 types.
[ tag ]
[ tag=neededvaluehere ]
[ tag=optionalvaluehere ]
Ah, yes, didn't notice that...
scottayy wrote:But looking at your regex example is very helpful! I had tried using $1 and it didn't work for me.. i guess \1 is what I was looking for.
Good. You realise what went wrong with your original idea, right? When matching a string with the regex:
and the 'c' is matched, the groups 1 and 2 will be empty. This, and my earlier observation of the reluctant groups, causes your empty matches.
Good luck.