Page 1 of 1
Get <object> with regex
Posted: Sun Apr 25, 2010 5:01 am
by hwdesign
I'm trying to use preg_match_all to find the <object></object> tags and everything in between.
I'm having a hard time wrapping my head around the regex. I tried this:
But it excluded the object tags themselves. I want to grab the entire <object></object code block.
Can anyone point me towards the right regex to achieve this?
Re: Get <object> with regex
Posted: Mon Apr 26, 2010 2:59 pm
by tr0gd0rr
First, you need to use an ungreedy operator: use `.*?` instead of `.*`. Without it, you will get more than you want if there are 2 or more object tags in the string.
Second, with `preg_match($regex, $subject, $match)` you should see `$match[0]` show the entire match including the tags.
Third, you don't need the parentheses if you are only need the entire match.
Re: Get <object> with regex
Posted: Tue Apr 27, 2010 10:48 am
by ridgerunner
You almost had it. tr0gd0rr was right, you need to use a lazy quantifier. Here is the simplest solution:
Code: Select all
// Example 1: Don't use capturing parentheses.
if (preg_match('%<object>.*?</object>%', $contents, $matches)) {
# Successful match
echo(sprintf("The whole match = %s\n", $matches[0]));
} else {
# Match attempt failed
echo("No match");
}
// Example 2: Use capturing parentheses to isolate <object> contents.
if (preg_match('%<object>(.*?)</object>%', $contents, $matches)) {
# Successful match
echo(sprintf("The <object> contents = %s\n", $matches[1]));
} else {
# Match attempt failed
echo("No match\n");
}
However, the simple solution above does not properly handle <object> tags when they are nested inside each other. If you have some subject text like this:
Code: Select all
<object>level one stuff <object>level two stuff</object> more level one stuff</object>
Then the above code will produce the following erroneous output:
Code: Select all
The whole match = <object>level one stuff <object>level two stuff</object>
The <object> contents = level one stuff <object>level two stuff
As you can see the match isn't quite right! Here is another script which correctly matches only the innermost <object></object> pair:
Code: Select all
// Example 3: Capture contents of innermost nested <object>...</object> tags
if (preg_match('%
<object[^>]*> # match any opening <object att="value" ...> tag
( # group $1: begin capturing <object>\'s contents
[^<]* # match zero or more non-left-angle-brackets
(?: # begin non-capture group to apply * quantifier
(?!</?object) # if this < is not the start of a <object or </object
< # then go ahead and match the <
[^<]* # and continue matching more non-left-angle-brackets
)* # keep doing this until the closing </object) found
) # end capture group $1 with <object>\'s contents
</object> # match the closing </object> tag.
%ix', $contents, $matches)) {
# Successful match
echo(sprintf("The whole match = %s\n", $matches[0]));
echo(sprintf("The <object> contents = %s\n", $matches[1]));
} else {
# Match attempt failed
echo("No match\n");
}
You can also write a regex that will correctly match the outermost <object> pair, but this requires using a recursive sub-expression which is rather complex.
Hope this helps!
