Ok. First off, for any non-trivial regex, always write it in long, commented (free-spacing) form to get a handle on all the parentheses levels. I've taken your regex and expanded it with comments so that it is readable. Let's take a look at it:
Code: Select all
$re_bbcode_tag = '%
\[ # Start BBCode opening tag.
([^\]=]+) # $1: Tag name.
(=.+?)? # $2: Equals sign and everything?
\] # End BBCode opening tag.
( # $3: BBCode tag contents. (2 alternatives)
# Note: 1st alternative consumes no text!
(?= # Start positive assertion. (why?)
( # $4: Capture one character. (why?)
[^\[/?\1(=.+?)?\]] # Erroneous expression inside char class!
) # End capture group $4
) # End positive assertion
| # or... 2nd alternative...
.+ # Greedily match one or more of anything!
) # End capture group $3
\[/\1\] # BBCode closing tag
%x';
Some comments right off the bat:
- Group $3 is capturing the contents of the tag. But there are two alternatives, the first of which is a positive lookahead which consumes no text at all.
- Inside the first alternative, is a negated character class which is certainly not matching what you intended. (Hint: a character class always matches exactly one character (and only one character) - you have erroneously placed an entire sub-expression inside.)
- The entire first alternative is doing absolutely nothing at all.
- This regex will not properly handle nested tags. (i.e. LISTs inside of LISTs or QUOTEs inside of QUOTEs).
- You are still unnecessarily using the DOT!
- The "s" single-line or dot-matches-all option is not specified. Thus your regex will not match tag contents which span more than one line.
You also need to define precisely what is allowed inside the BBCode tag. Do you allow whitespace around the equals sign? What characters are allowed inside a tag attribute? Whitespace? Square brackets? Can an attribute be enclosed within single or double quotes? If so, what characters are allowed inside the quotes? These are some of the questions that need to be answered before you can sit down and craft a regex that correctly matches your requirements.
As I said before, what you are attempting is quite advanced (and as Eran correctly points out, many would argue that regex is not even up to this parsing job). However, I personally believe that a carefully crafted regex can do what you are asking. But to give you a glimpse of the complexity required to pull it off right, here is a regex which correctly matches just an opening BBCode tag (which allows whitespace around the equals sign and quoted or non-quoted attribute values):
Code: Select all
$re_bbcode_opening_tag = '/
# Match BBCode opening tag with optional attribute (Rev20101028_1500)
\[ # Start BBCode opening tag.
([\w*]++) # $1: = Tag name. (Note: [*] is a valid tag.)
\s*+ # Prune any whitespace following tag name.
(?: # Start non-capture group for optional attribute.
(=) # $2: = equals sign flags attribute existence.
\s*+ # Prune any whitespace following equal sign.
(?> # Atomic group for attribute value alternatives.
\'([^\']*+)\' # $3: = single quoted value, or...
| "([^"]*+)" # $4: = double quoted value, or...
| ([^\'"[\]\s]++) # $5: = non-quoted value.
) # End atomic group of attribute values.
\s*+ # Prune any whitespace following attribute value.
)?+ # Attribute is optional.
\] # End BBCode opening tag.
/x';
As you can see, it uses atomic grouping and possessive quantifiers which are helpful to both accuracy and speed for both matching and non-matching (and it does not use a single DOT). For you see, it is not good enough to just match what you are after, it is equally important to
NOT match that which you are not after. In other words, your regex must be able to quickly match (or not match) any string you throw at it. This is important because one of the biggest problems you will encounter with regular expressions, is
catastrophic backtracking - or what some call:
"going super-linear". This condition must be avoided for a regex solution to work reliably. This undesirable super-linear behavior is particularly prevalent when matching against long target strings. (And this is also one of the reasons the dot-star and dot-plus expressions can be problematic. See the link.)
Once again, if you are serious about using regex for HTML/BBCode parsing, you must immerse yourself and learn the intricate details of how the underlying regex engine works, in order to be able to craft an expression that is both accurate and efficient (and doesn't go super-linear on bad input). This does require a significant time investment on your part to learn this technology, and IMHO, reading MRE3 is the only way to go. Note that it is my experience that this time spent will pay for itself many times over once you get to the point where you can actually
"think in regex".
But before you jump into the deep end, I would recommend going through the excellent
tutorial at regular-expressions.info. This will give you a good starting point from which to work.
Hope this helps!
