Case in point: I'm trying to extract everything in some .htm files that's inside the body tag. Now, I'd *love* to just use HTMLPurifier, but it's a little too agressive and I get some dropped images, and a whole lot of hatted a's in my text. Instead, I figured what I needed could be done with a fairly simple regex. Here's what I have:
Code: Select all
function get_html_body($html)
{
$pattern = '/<body.*>(.*)<\/body>/';
$matches = array();
if (preg_match($pattern, $html, $matches))
return $matches[1];
else
return false;
}Looking at it here, I'm thinking that maybe the .* after the first "body" might be eating everything?
I changed the pattern to:
Code: Select all
'/<body[^>]*>(.*)<\/body>/'Any thoughts or ideas?
EDIT: I realized my post title, while witty, was not helpful to those that may be able to help me... so I changed it.