Okay, here's a possible way to tackle this:
Code: Select all
<?php
$text = <<< BLOCK
<foobar> text <forbar,bar,foo> and <foobar,foo="<foo>",bar,foobar>
<name,value="Paul"> more noise <fullname,value="<name> Smith"> foo
<fullname> and an ultimate test:
<fullname,value="<name<nested<more-nesting!!!>>> Smith"> ok, done.
BLOCK;
$regex = '/<(?:[^>]|.(?!(?:[^"]*"[^"]*")*[^"]*$))*>/';
if(preg_match_all($regex, $text, $matches)) {
print_r($matches);
}
/* Output:
Array
(
[0] => Array
(
[0] => <foobar>
[1] => <forbar,bar,foo>
[2] => <foobar,foo="<foo>",bar,foobar>
[3] => <name,value="Paul">
[4] => <fullname,value="<name> Smith">
[5] => <fullname>
[6] => <fullname,value="<name<nested<more-nesting!!!>>> Smith">
)
)
*/
?>
The regex itself is amazingly simple (the logic that is). Let me explain:
Code: Select all
$regex = '/
< # match a "<"
(?: # open non-capturing group 1
[^>] # match any character except ">"
| # OR
.(?!(?:[^"]*"[^"]*")*[^"]*$) # any character that does not have an even number of double quotes in front of it
) # close non-capturing group 1
* # match group 1 zero or more times
> # match a ">"
/x';
Note that you can just copy and paste this in your code: the 'x' modifier will ignore white spaces and the comments in your regex-string.
But... there are some drawbacks (of course). When you're working with very large strings, it might slow down because of all the looking ahead. And also, I must stress that regex do NOT make good parsers. It really looks like you need a decent parser for this.
My 2 cents.
HTH