Page 1 of 1

Regular Expression returning empty Array ( )

Posted: Thu Oct 07, 2010 4:43 am
by Miteshsach86
Hi fellow developers,

I'm having a real problem at the moment, I'm trying to capture everything in between <body></body> tags using the following code but it does not print anything:

Code: Select all

$lines = file("http://www.bbc.co.uk/");

foreach ($lines as $line_num => $line) {
$thecontent .= htmlspecialchars($line) . "<br />\n";
}
preg_match('/<body.*?>(.*?)<\/body >/', $thecontent, $htmltext);
$moretext = $htmltext[1];
echo $moretext;
When you do place a "print($thecontent);" into the code the entire html for [whatever the website] does display but I want to capture only the html code in between the body tags. I've tried everything but I just can't get this to work. :? :banghead:

I would appreciate anyone's help and I'd like to thank you in advance.

M

Re: Regular Expression returning empty Array ( )

Posted: Thu Oct 07, 2010 5:11 am
by Benjamin
.* doesn't match new lines.

Code: Select all

preg_match('#<body[^>]+>([\s\S]*)<\s{0,1}/body>#i', $thecontent, $htmltext);

Re: Regular Expression returning empty Array ( )

Posted: Thu Oct 07, 2010 5:43 am
by requinix
Benjamin wrote:.* doesn't match new lines.
...by default. Add the 's' flag and it will.

Re: Regular Expression returning empty Array ( )

Posted: Thu Oct 07, 2010 7:20 am
by Miteshsach86
Hi Guys,

Thanks for your reply.. Unfortunately that's also giving me an empty "Array ( )" :(

Is there anything else that you think I'm doing wrong? :?

M

Re: Regular Expression returning empty Array ( )

Posted: Thu Oct 07, 2010 7:43 am
by twinedev
The problem is that you are using the htmlspecialchars() on the data before you are doing the regular expression. (This is in addition to the needing the s option at the end of the expression to allow newlines to be matched)

Two choices here:

1. Change your expression to be:

Code: Select all

preg_match('/<body.*?>(.*?)<\/body>/s', $thecontent, $htmltext);
2. (my recommendation), wait until after you have captured it before converting it, and then in that case you just needed the s to the end of the expression for new lines (IMO, always best to work with as much original "raw" data as possible, only convert right before needing it converted).

Also a note, when I just copied and pasted the code you posted here, there was a space between </body and the closing >. There shouldn't be one. If the code you are grabbing may have one by mistake end the search with </body.*?>

-Greg

Re: Regular Expression returning empty Array ( )

Posted: Fri Oct 08, 2010 6:16 am
by Miteshsach86
Thanks for all your help guys!

Much appreciated :)