matching all bb tags - regex almost works!

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

matching all bb tags - regex almost works!

Post by s.dot »

I want to match all text in [tag]text[/tag] format

I have been playing with this regexp for a while now and I've nearly got it working except for two things:

1) It will match all BB tags I've thrown at it except [img] ... why is it not matching [img]?
2) The (?R) recursion. The recursion is not showing up in my output.. do I have to recursively call the preg_match_all() function to get this to work as desired? I know the recursion is working because the regex will go into a nest of tags until it cannot go deeper.

Here is the regex call:

Code: Select all

if (preg_match_all('/\[(.+?)(=.+?)?\]((?:[^\[\]|\[(?!\/?\\1(=.+?)?\])]+)|(?R))\[\/\\1\]/', $this->str, $matches, PREG_SET_ORDER))
{
	echo '<pre>'; print_r($matches); echo '</pre>';
}
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Re: matching all bb tags - regex almost works!

Post by s.dot »

Ah, I believe I've refined the problem further. It will match [ img ]text[ /img ] only if the "text" part doesn't have // or . in them (as a URL always does). This also applies for [ url ].

EDIT| OK, this applies for any tag with those characters inside of them. At least now I know where my regex is failing ;)
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: matching all bb tags - regex almost works!

Post by ridgerunner »

I could just give you the answer but that would be no fun. Since you have expressed some interest in learning this stuff, I'll give you a couple hints and some advice instead. First off, the hints... Don't use the all-powerful-dot unless you really need it (and you don't here). In other words: "Say what you mean and mean what you say." Regex allows you to be very precise in what you tell it to match - use this to your advantage. (e.g. Your first lazy dot ends when it reaches either an equals sign or a closing right square bracket. Wouldn't it be better to use something along the lines of: '([^\]=]*)' or possibly: '(\w+)' instead?) And to get the recursive part to work, you'll need to apply a quantifier to the group that contains it (and the PHP manual is your friend).

Secondly, the (?R) construct is an advanced expression you need to be very careful with. Regular expressions in general should be handled like a loaded weapon (you can easily shoot yourself in the foot), but these more advanced features (such as (?R), (?1), (?|(A)|(B)|(C)), etc.) should be handled like old dynamite! A truly great regex needs to be carefully crafted, with detailed knowledge of how the underlying engine is working, to get them working correctly under all conditions (i.e. not only matching what you want, but NOT matching what you don't want - and doing so quickly for all subject strings.)

If you are serious about really learning this stuff, there is no better way than to just sit down and read: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl. I can honestly say that this is the most useful book I have ever read - highly recommended! (The third revision of the book has a whole chapter dedicated to PHP and it covers the recursive patterns in detail.)

That said, if you don't wish to spend the time learning this for yourself, I will be glad to help out (I've been writing a BBCode-2-HTML parser for the FluxBB open source forum project, and so I am intimately familiar with this specific regex you are working on.)
:)
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Re: matching all bb tags - regex almost works!

Post by s.dot »

Thank you! Going to try researching your notes and see what I come up with.

I'm not too determined on learning the regex engine fully because I rarely use regex for more than simple matching - so just getting this one right will make me happy.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: matching all bb tags - regex almost works!

Post by ridgerunner »

Ok, let me know when you say "uncle". (There are other gotchas I haven't mentioned yet... This is most certainly NOT a simple regex!)
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Re: matching all bb tags - regex almost works!

Post by s.dot »

I've sent you a private message.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Re: matching all bb tags - regex almost works!

Post by s.dot »

OK, it makes a lot of sense to use this for the opening tag matching .. eg [tag] or [tag=something]

Code: Select all

[([^\]\[=]*)(=.+?)?\]
This way the part before the optional = cannot contain [, ], or =
However, after the equal sign, I think anything should be allowed to allow for nesting such as:

Code: Select all

[quote=[url=http://www.example.com]this page[/url]]text[/quote]
Right? It also seems like there should be recursion within the part after the = sign in the opening tag to check for this type of behavior?

And, I'm screaming uncle on why it's not matching:

Code: Select all

[img]http://www.example.com/image.jpg[/img]
EDIT| I got why it wasn't matching URLs

Code: Select all

/\[([^\]\[=]*)(=.+?)?\]((?:[^\[\]|\[(?!\/\\1(=.+?)?\])]+)|(.+?)|(?R)*)\[\/\\1\]/
I had to allow for a character grouping that allowed anything rather than specifically deny something.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: matching all bb tags - regex almost works!

Post by Eran »

When you start getting into nested tags and attributes, that's when regex starts showing its limitations.
http://stackoverflow.com/questions/1732 ... 54#1732454
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Re: matching all bb tags - regex almost works!

Post by s.dot »

Even if I can't figure out the regex, I can always recursively call the regex on matches or attribrutes found.
But for obvious reasons I'd rather get it all in one call.

I am back to this as my regex:

Code: Select all

/\[([^\]\[=]*)(=.+?)?\]((?:[^(\[\]|\(?!\/?\\1(=.+?)?\])]+)|(?R)+)\[\/\\1\]/
I'm back to not being able to match text inside tags including / or . characters and the recursion problem.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Re: matching all bb tags - regex almost works!

Post by s.dot »

OK, progress:

Code: Select all

<?php

$str = '[b][u][i]special[/i][/u][/b] [quote="scott"]one[quote="scottz"]two[/quote][/quote] [url]http://www.example.com[/url]';
$pattern = '#\[([^\]=]+)(=.+?)?\]((?=([^\[/?\\1(=.+?)?\]]))|.+)\[/\\1\]#';

echo $str;

if (preg_match_all($pattern, $str, $matches, PREG_SET_ORDER))
{
  echo '<pre>';
  print_r($matches);
  echo '</pre>';
} else
{
  echo 'No matches.';
}
Produces:

Code: Select all

[b][u][i]special[/i][/u][/b] [quote="scott"]one[quote="scottz"]two[/quote][/quote] [url]http://www.example.com[/url]

Array
(
    [0] => Array
        (
            [0] => [b][u][i]special[/i][/u][/b]
            [1] => b
            [2] => 
            [3] => [u][i]special[/i][/u]
        )

    [1] => Array
        (
            [0] => [quote="scott"]one[quote="scottz"]two[/quote][/quote]
            [1] => quote
            [2] => ="scott"
            [3] => one[quote="scottz"]two[/quote]
        )

    [2] => Array
        (
            [0] => [url]http://www.example.com[/url]
            [1] => url
            [2] => 
            [3] => http://www.example.com
        )

)
Very cool. :) It handles nesting until it finds the ending tag that matches the opening tag, and it handles matching the URL characters.

Now if I could only get that recursion working.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: matching all bb tags - regex almost works!

Post by ridgerunner »

Ok. First off, for any non-trivial regex, always write it in long, commented (free-spacing) form to get a handle on all the parentheses levels. I've taken your regex and expanded it with comments so that it is readable. Let's take a look at it:

Code: Select all

$re_bbcode_tag = '%
\[                        # Start BBCode opening tag.
([^\]=]+)                 # $1: Tag name.
(=.+?)?                   # $2: Equals sign and everything?
\]                        # End BBCode opening tag.
(                         # $3: BBCode tag contents. (2 alternatives)
                          # Note: 1st alternative consumes no text!
  (?=                     # Start positive assertion. (why?)
    (                     # $4: Capture one character. (why?)
      [^\[/?\1(=.+?)?\]]  # Erroneous expression inside char class!
    )                     # End capture group $4
  )                       # End positive assertion
|                         # or... 2nd alternative...
  .+                      # Greedily match one or more of anything!
)                         # End capture group $3
\[/\1\]                   # BBCode closing tag
%x';
Some comments right off the bat:
  • Group $3 is capturing the contents of the tag. But there are two alternatives, the first of which is a positive lookahead which consumes no text at all.
  • Inside the first alternative, is a negated character class which is certainly not matching what you intended. (Hint: a character class always matches exactly one character (and only one character) - you have erroneously placed an entire sub-expression inside.)
  • The entire first alternative is doing absolutely nothing at all.
  • This regex will not properly handle nested tags. (i.e. LISTs inside of LISTs or QUOTEs inside of QUOTEs).
  • You are still unnecessarily using the DOT!
  • The "s" single-line or dot-matches-all option is not specified. Thus your regex will not match tag contents which span more than one line.
You also need to define precisely what is allowed inside the BBCode tag. Do you allow whitespace around the equals sign? What characters are allowed inside a tag attribute? Whitespace? Square brackets? Can an attribute be enclosed within single or double quotes? If so, what characters are allowed inside the quotes? These are some of the questions that need to be answered before you can sit down and craft a regex that correctly matches your requirements.

As I said before, what you are attempting is quite advanced (and as Eran correctly points out, many would argue that regex is not even up to this parsing job). However, I personally believe that a carefully crafted regex can do what you are asking. But to give you a glimpse of the complexity required to pull it off right, here is a regex which correctly matches just an opening BBCode tag (which allows whitespace around the equals sign and quoted or non-quoted attribute values):

Code: Select all

$re_bbcode_opening_tag = '/
# Match BBCode opening tag with optional attribute (Rev20101028_1500)
\[                   # Start BBCode opening tag.
([\w*]++)            # $1: = Tag name. (Note: [*] is a valid tag.)
\s*+                 # Prune any whitespace following tag name.
(?:                  # Start non-capture group for optional attribute.
  (=)                # $2: = equals sign flags attribute existence.
  \s*+               # Prune any whitespace following equal sign.
  (?>                # Atomic group for attribute value alternatives.
    \'([^\']*+)\'    # $3: = single quoted value, or...
  | "([^"]*+)"       # $4: = double quoted value, or...
  | ([^\'"[\]\s]++)  # $5: = non-quoted value.
  )                  # End atomic group of attribute values.
  \s*+               # Prune any whitespace following attribute value.
)?+                  # Attribute is optional.
\]                   # End BBCode opening tag.
/x';
As you can see, it uses atomic grouping and possessive quantifiers which are helpful to both accuracy and speed for both matching and non-matching (and it does not use a single DOT). For you see, it is not good enough to just match what you are after, it is equally important to NOT match that which you are not after. In other words, your regex must be able to quickly match (or not match) any string you throw at it. This is important because one of the biggest problems you will encounter with regular expressions, is catastrophic backtracking - or what some call: "going super-linear". This condition must be avoided for a regex solution to work reliably. This undesirable super-linear behavior is particularly prevalent when matching against long target strings. (And this is also one of the reasons the dot-star and dot-plus expressions can be problematic. See the link.)

Once again, if you are serious about using regex for HTML/BBCode parsing, you must immerse yourself and learn the intricate details of how the underlying regex engine works, in order to be able to craft an expression that is both accurate and efficient (and doesn't go super-linear on bad input). This does require a significant time investment on your part to learn this technology, and IMHO, reading MRE3 is the only way to go. Note that it is my experience that this time spent will pay for itself many times over once you get to the point where you can actually "think in regex".

But before you jump into the deep end, I would recommend going through the excellent tutorial at regular-expressions.info. This will give you a good starting point from which to work.

Hope this helps!
:)
Last edited by ridgerunner on Sat Oct 30, 2010 2:51 pm, edited 1 time in total.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: matching all bb tags - regex almost works!

Post by Eran »

I really like your regex commenting style, ridgerunner, I think I will borrow it with your permission :)
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: matching all bb tags - regex almost works!

Post by ridgerunner »

Eran wrote:I really like your regex commenting style, ridgerunner, I think I will borrow it with your permission :)
Borrow Away! :)
With a language as concise as regex, in my mind it would be *insane* to write a complex expression without comments. I worked for many years writing in 100% assembly language which is similar to regex in its use of simple tokens. In ASM, for me, it was/is imperative to comment EVERY SINGLE LINE (so that I would be able to understand it 6 months later). I just got in the habit of commenting every line that way, (and it is easiest to add when actually writing the code - rather than adding it later). This commenting style has carried over for me into the world of regex.

Would anyone prefer to write even the simplest of C programs like this:?

Code: Select all

#include <stdio.h> main(){printf("hello, world\n");}
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: matching all bb tags - regex almost works!

Post by josh »

ridgerunner wrote:sit down and read: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl.
Youve convinced me.
Post Reply