Get <object> with regex

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
hwdesign
Forum Newbie
Posts: 9
Joined: Fri Dec 18, 2009 4:40 am

Get <object> with regex

Post by hwdesign »

I'm trying to use preg_match_all to find the <object></object> tags and everything in between.

I'm having a hard time wrapping my head around the regex. I tried this:

Code: Select all

/\<object(.*)\<\/object\>/is
But it excluded the object tags themselves. I want to grab the entire <object></object code block.

Can anyone point me towards the right regex to achieve this?
User avatar
tr0gd0rr
Forum Contributor
Posts: 305
Joined: Thu May 11, 2006 8:58 pm
Location: Utah, USA

Re: Get <object> with regex

Post by tr0gd0rr »

First, you need to use an ungreedy operator: use `.*?` instead of `.*`. Without it, you will get more than you want if there are 2 or more object tags in the string.

Second, with `preg_match($regex, $subject, $match)` you should see `$match[0]` show the entire match including the tags.

Third, you don't need the parentheses if you are only need the entire match.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Get <object> with regex

Post by ridgerunner »

You almost had it. tr0gd0rr was right, you need to use a lazy quantifier. Here is the simplest solution:

Code: Select all

// Example 1: Don't use capturing parentheses.
if (preg_match('%<object>.*?</object>%', $contents, $matches)) {
    # Successful match
    echo(sprintf("The whole match = %s\n", $matches[0]));
} else {
    # Match attempt failed
    echo("No match");
}

// Example 2: Use capturing parentheses to isolate <object> contents.
if (preg_match('%<object>(.*?)</object>%', $contents, $matches)) {
    # Successful match
    echo(sprintf("The <object> contents = %s\n", $matches[1]));
} else {
    # Match attempt failed
    echo("No match\n");
}
However, the simple solution above does not properly handle <object> tags when they are nested inside each other. If you have some subject text like this:

Code: Select all

<object>level one stuff <object>level two stuff</object> more level one stuff</object>
Then the above code will produce the following erroneous output:

Code: Select all

The whole match = <object>level one stuff <object>level two stuff</object>
The <object> contents = level one stuff <object>level two stuff
As you can see the match isn't quite right! Here is another script which correctly matches only the innermost <object></object> pair:

Code: Select all

// Example 3: Capture contents of innermost nested <object>...</object> tags
if (preg_match('%
    <object[^>]*>      # match any opening <object att="value" ...> tag
    (                  # group $1: begin capturing <object>\'s contents
      [^<]*            # match zero or more non-left-angle-brackets
      (?:              # begin non-capture group to apply * quantifier
        (?!</?object)  # if this < is not the start of a <object or </object
        <              # then go ahead and match the <
        [^<]*          # and continue matching more non-left-angle-brackets
      )*               # keep doing this until the closing </object) found
    )                  # end capture group $1 with <object>\'s contents
    </object>          # match the closing </object> tag.
    %ix', $contents, $matches)) {
    # Successful match
    echo(sprintf("The whole match = %s\n", $matches[0]));
    echo(sprintf("The <object> contents = %s\n", $matches[1]));
} else {
    # Match attempt failed
    echo("No match\n");
}
You can also write a regex that will correctly match the outermost <object> pair, but this requires using a recursive sub-expression which is rather complex.

Hope this helps! :)
Post Reply