Page 1 of 1

Extracting html body with regex [SOLVED]

Posted: Mon Jul 30, 2007 1:34 pm
by ReverendDexter
Everything I've read says that unless you tell it not to be, that regex is greedy, and takes in everything it possibly can. However, on my system, it never seems to work that way. I think my regex might be anorexic :)

Case in point: I'm trying to extract everything in some .htm files that's inside the body tag. Now, I'd *love* to just use HTMLPurifier, but it's a little too agressive and I get some dropped images, and a whole lot of hatted a's in my text. Instead, I figured what I needed could be done with a fairly simple regex. Here's what I have:

Code: Select all

function get_html_body($html)
{
	$pattern = '/<body.*>(.*)<\/body>/';
	$matches = array();
	if (preg_match($pattern, $html, $matches))
		return $matches[1];
	else 
		return false;
}
which returns a whole lot of nothing.

Looking at it here, I'm thinking that maybe the .* after the first "body" might be eating everything?

I changed the pattern to:

Code: Select all

'/<body[^>]*>(.*)<\/body>/'
but that doesn't seem to work either.

Any thoughts or ideas?

EDIT: I realized my post title, while witty, was not helpful to those that may be able to help me... so I changed it.

Posted: Mon Jul 30, 2007 2:06 pm
by feyd
The default mode for PCRE most often is in line mode. i.e. the match must be found within a single line or it will not match. Add the pattern modifier "s"

Posted: Mon Jul 30, 2007 3:01 pm
by ReverendDexter
Feyd, you're my hero.