Extracting html body with regex [SOLVED]

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
ReverendDexter
Forum Contributor
Posts: 193
Joined: Tue May 29, 2007 1:26 pm
Location: Chico, CA

Extracting html body with regex [SOLVED]

Post by ReverendDexter »

Everything I've read says that unless you tell it not to be, that regex is greedy, and takes in everything it possibly can. However, on my system, it never seems to work that way. I think my regex might be anorexic :)

Case in point: I'm trying to extract everything in some .htm files that's inside the body tag. Now, I'd *love* to just use HTMLPurifier, but it's a little too agressive and I get some dropped images, and a whole lot of hatted a's in my text. Instead, I figured what I needed could be done with a fairly simple regex. Here's what I have:

Code: Select all

function get_html_body($html)
{
	$pattern = '/<body.*>(.*)<\/body>/';
	$matches = array();
	if (preg_match($pattern, $html, $matches))
		return $matches[1];
	else 
		return false;
}
which returns a whole lot of nothing.

Looking at it here, I'm thinking that maybe the .* after the first "body" might be eating everything?

I changed the pattern to:

Code: Select all

'/<body[^>]*>(.*)<\/body>/'
but that doesn't seem to work either.

Any thoughts or ideas?

EDIT: I realized my post title, while witty, was not helpful to those that may be able to help me... so I changed it.
Last edited by ReverendDexter on Mon Jul 30, 2007 3:02 pm, edited 1 time in total.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

The default mode for PCRE most often is in line mode. i.e. the match must be found within a single line or it will not match. Add the pattern modifier "s"
User avatar
ReverendDexter
Forum Contributor
Posts: 193
Joined: Tue May 29, 2007 1:26 pm
Location: Chico, CA

Post by ReverendDexter »

Feyd, you're my hero.
Post Reply