Extracting html body with regex [SOLVED]
Posted: Mon Jul 30, 2007 1:34 pm
Everything I've read says that unless you tell it not to be, that regex is greedy, and takes in everything it possibly can. However, on my system, it never seems to work that way. I think my regex might be anorexic 
Case in point: I'm trying to extract everything in some .htm files that's inside the body tag. Now, I'd *love* to just use HTMLPurifier, but it's a little too agressive and I get some dropped images, and a whole lot of hatted a's in my text. Instead, I figured what I needed could be done with a fairly simple regex. Here's what I have:
which returns a whole lot of nothing.
Looking at it here, I'm thinking that maybe the .* after the first "body" might be eating everything?
I changed the pattern to:
but that doesn't seem to work either.
Any thoughts or ideas?
EDIT: I realized my post title, while witty, was not helpful to those that may be able to help me... so I changed it.
Case in point: I'm trying to extract everything in some .htm files that's inside the body tag. Now, I'd *love* to just use HTMLPurifier, but it's a little too agressive and I get some dropped images, and a whole lot of hatted a's in my text. Instead, I figured what I needed could be done with a fairly simple regex. Here's what I have:
Code: Select all
function get_html_body($html)
{
$pattern = '/<body.*>(.*)<\/body>/';
$matches = array();
if (preg_match($pattern, $html, $matches))
return $matches[1];
else
return false;
}Looking at it here, I'm thinking that maybe the .* after the first "body" might be eating everything?
I changed the pattern to:
Code: Select all
'/<body[^>]*>(.*)<\/body>/'Any thoughts or ideas?
EDIT: I realized my post title, while witty, was not helpful to those that may be able to help me... so I changed it.