Page 1 of 1

Get <object> with regex

Posted: Sun Apr 25, 2010 5:01 am
by hwdesign
I'm trying to use preg_match_all to find the <object></object> tags and everything in between.

I'm having a hard time wrapping my head around the regex. I tried this:

Code: Select all

/\<object(.*)\<\/object\>/is
But it excluded the object tags themselves. I want to grab the entire <object></object code block.

Can anyone point me towards the right regex to achieve this?

Re: Get <object> with regex

Posted: Mon Apr 26, 2010 2:59 pm
by tr0gd0rr
First, you need to use an ungreedy operator: use `.*?` instead of `.*`. Without it, you will get more than you want if there are 2 or more object tags in the string.

Second, with `preg_match($regex, $subject, $match)` you should see `$match[0]` show the entire match including the tags.

Third, you don't need the parentheses if you are only need the entire match.

Re: Get <object> with regex

Posted: Tue Apr 27, 2010 10:48 am
by ridgerunner
You almost had it. tr0gd0rr was right, you need to use a lazy quantifier. Here is the simplest solution:

Code: Select all

// Example 1: Don't use capturing parentheses.
if (preg_match('%<object>.*?</object>%', $contents, $matches)) {
    # Successful match
    echo(sprintf("The whole match = %s\n", $matches[0]));
} else {
    # Match attempt failed
    echo("No match");
}

// Example 2: Use capturing parentheses to isolate <object> contents.
if (preg_match('%<object>(.*?)</object>%', $contents, $matches)) {
    # Successful match
    echo(sprintf("The <object> contents = %s\n", $matches[1]));
} else {
    # Match attempt failed
    echo("No match\n");
}
However, the simple solution above does not properly handle <object> tags when they are nested inside each other. If you have some subject text like this:

Code: Select all

<object>level one stuff <object>level two stuff</object> more level one stuff</object>
Then the above code will produce the following erroneous output:

Code: Select all

The whole match = <object>level one stuff <object>level two stuff</object>
The <object> contents = level one stuff <object>level two stuff
As you can see the match isn't quite right! Here is another script which correctly matches only the innermost <object></object> pair:

Code: Select all

// Example 3: Capture contents of innermost nested <object>...</object> tags
if (preg_match('%
    <object[^>]*>      # match any opening <object att="value" ...> tag
    (                  # group $1: begin capturing <object>\'s contents
      [^<]*            # match zero or more non-left-angle-brackets
      (?:              # begin non-capture group to apply * quantifier
        (?!</?object)  # if this < is not the start of a <object or </object
        <              # then go ahead and match the <
        [^<]*          # and continue matching more non-left-angle-brackets
      )*               # keep doing this until the closing </object) found
    )                  # end capture group $1 with <object>\'s contents
    </object>          # match the closing </object> tag.
    %ix', $contents, $matches)) {
    # Successful match
    echo(sprintf("The whole match = %s\n", $matches[0]));
    echo(sprintf("The <object> contents = %s\n", $matches[1]));
} else {
    # Match attempt failed
    echo("No match\n");
}
You can also write a regex that will correctly match the outermost <object> pair, but this requires using a recursive sub-expression which is rather complex.

Hope this helps! :)