Page 1 of 2

grabbing/ parse html from remote website (reg expressions)

Posted: Mon Sep 29, 2003 12:28 pm
by kendall
Hello,

the following is code i wrote to grab news headlines within a <h2> tag to output

Code: Select all

$fp = fopen('http://www.trinidadexpress.com','r');
$content = fread($fp,15000);
fclose($fp);
$exp = "`(<h2[.*>]+>(\\n|.*)<\/h2>)<font[.*>]+>(\\n|.*)<\/font>|(<p[.*>]+>(\\n|.*)<\/p>)`i";
$news = preg_match_all($exp,$content,$headlines);
echo $news.'<br>';
for($i=0;$i<=count($headlines);$i++){
	echo $headlines[$i][0].'<br>';
	echo $headlines[$i][1].'<br>';
}
the problem i am getting a '0' match which shouldn't be because the content source specifically has <h2> tags in it to identify headline news
i take it my expression is wrong but what exactly is the incorrect factor here??

Kendall

Posted: Mon Sep 29, 2003 3:01 pm
by pootergeist
my regex is pants - what I do know is that peg_match_all will return the search string interpreted in the first index and only yield the results in the second index

$headlines[1][0] - first match on that preg_match_all call
$headlines[1][1] - second match
etc etc

$headlines[0][0] would contain something of the order <h2>kjshafkjhas

Posted: Wed Oct 01, 2003 9:42 am
by Derfel Cadarn
My regex isn't pants, it's worse.
But I got the script working after I changed it into this (using the title-tags to search for):

Code: Select all

<?php

$fp = fopen('http://www.trinidadexpress.com','r');

while ($content = fgets($fp, 1024)) {
    $exp = "/\<title\>?([^\/]+)\<\/title\>/";
    $news = preg_match_all($exp,$content,$headlines,PREG_SET_ORDER);
    if ($news) {
        for($i=0;$i<=count($headlines);$i++){
           echo $headlines[$i][0].': ';
           echo $headlines[$i][1]."<br>\n";
        }
    }
}

?>
You'll have to design your regex yourself, I'm afraid.

I hope it's of any use!

Edit: I just changed it because I noticed that there were no <h2>-tags in the first 15000 characters!!!


?>

grabbing/ parse html from remote website (reg expressions)

Posted: Wed Oct 01, 2003 2:04 pm
by kendall
Derfel Cadarn,

Actually i had gotten this far

Code: Select all

$exp = "`((<(h2)&#1111;^>]*><(font)&#1111;^>]*>+&#1111;\w\s\n]+)\n*)+`s";
which gets me what i want to a certain point...it finds the opening <h2 att> and the text that is "format" (which is what im after) but for some reason i can't get it to retireve the </h2> tags as well...further more it refuses to match punctuations so a dollar sign or apostrafee are ignored.

can anyone add to it

Kendall

Posted: Wed Oct 01, 2003 5:00 pm
by kendall
just to update,

$exp = "`<(h2)[^>]*><(font)[^>]*>+[\w\s\n\!,\$]+\n*+`Sm";

this expression is working but i don't know how to get the apotrafees even if i include a ' in the expression i don't get it

Kendall

Posted: Wed Oct 01, 2003 6:05 pm
by JAM
If we are comparing knowledge and pants, I would have any...

$exp = "'<(h2)[^>]*><(font)[^>]*>+[\w\s\n''\!,\$]+\n*+'Sm";
Added: ''

Would that help?

Posted: Thu Oct 02, 2003 4:24 am
by Derfel Cadarn
That regex seems to work, JAM, but it doesn't go past a line break.
I mean: when the HTML is written like this:

Code: Select all

<h2 align="left" style="margin-top: 0; margin-bottom: 0"><font face="Tahoma" size="3">Police Association calls
for end to acting post</font></h2>
only "Police Association calls" is returned and not the second part of the text "for end to acting post".

I'll try some more..

Posted: Thu Oct 02, 2003 9:44 am
by kendall
Derfel,

That is strange...i get the new line characters fine....but for some reason i don't get the apostrafee at ALL even doe its '' is in the expression...do i proably need to do a '' \w ? im trying some angles

don't seem to be making much progress

Kendall

Posted: Thu Oct 02, 2003 9:51 am
by JAM
Currently cant test on my server myself so just outputting ideas.

You could walk past the linebreak problem using str_replace() or sometihng similiar, if that is your problem.

Posted: Thu Oct 02, 2003 10:36 am
by kendall
JAM,

no...i don't no how come Drefel is not matching the new line...i am not able to match the apostrafeee

Kendall

Posted: Thu Oct 02, 2003 11:04 am
by Derfel Cadarn
It's getting mysterious to me....
My code now looks like this:

Code: Select all

<?php

$fp = fopen('http://www.trinidadexpress.com','r');

while ($content = fgets($fp, 1024)) {
	$exp = "`<(h2)[^>]*><(font)[^>]*>+[\w\s\n\!,\$]+\n*+`Sm";  //kendall's
	//$exp = "'<(h2)[^>]*><(font)[^>]*>+[\w\s\n''\!,\$]+\n*+'Sm";  //jam's
    $news = preg_match_all($exp,$content,$headlines,PREG_SET_ORDER);
    if ($news) {
        for($i=0;$i<=count($headlines);$i++){
           echo $headlines[$i][0].': ';
           echo $headlines[$i][1]."<br>\n";
        }
    }
}

?>
When I upload it to my host I get the error:

Code: Select all

Warning: Compilation failed: nothing to repeat at offset 39 in .../hello.php on line 34
Line 34 is the line containing

Code: Select all

$news = preg_match_all($exp,$content,$headlines,PREG_SET_ORDER);
Whe I test it offline I get:

Code: Select all

No water at new $16m PoS wing: h2
:
COURT : h2
:
Kidnapped : h2
:
QRC back on top: h2
:
On my host I got PHP 4.2.2, on my own PC I've got 4.3.1..
I'm buzzzed! 8O

I'm probably overlooking a comma or something....

Posted: Thu Oct 02, 2003 11:17 am
by kendall
Derfel Cadarn,

Believe it or not Derfel i was doing the match while reading the file and got that error...but what i did was read all the contents first then try to match it...

Code: Select all

while (!feof($fp)){
	$content .= fread($fp,150000);
} 
fclose($fp);
$exp = "`<(h2)[^>]*><(font)[^>]*>+[\w\s\n\!\,\$\:(''\w'')]+\n*+`Sm";
$news = preg_match_all($exp,$content,$headlines);
if ($news) {
	for($i=0;$i<=count($headlines);$i++)
		echo strip_tags($headlines[0][$i])."<br><hr>";
}
This give me what i want...including the new lines...however i just can't seem to match the apostraphee...are you able to?

Kendall

Posted: Fri Oct 03, 2003 4:43 am
by Derfel Cadarn
He Kendall, that code works...on my own pc, but not on my host 8O
The complete code now looks like:

Code: Select all

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Test</title>
</head>
<body>
<form action="<?php echo $PHP_SELF; ?>" method=POST>

<?php

$fp = fopen('http://www.trinidadexpress.com','r');

while (!feof($fp)){
   $content .= fread($fp,150000);
}

fclose($fp);

$exp = "`<(h2)[^>]*><(font)[^>]*>+[\w\s\n\!\,\$\''\w'')]+\n*+`Sm";
$news = preg_match_all($exp,$content,$headlines);

if ($news) {
   for($i=0;$i<=count($headlines);$i++)
      echo strip_tags($headlines[0][$i])."<br><hr>";
}


?>
</body>
</html>
?>
I still get the errormessage:
Warning: Compilation failed: nothing to repeat at offset 47 in /usr/local/psa/home/.../httpdocs/test/hello.php on line 20
Does it work correct with you? Is it possible there are differences between PHP 431 and 422 concerning this point? I'll have to check php.net for the error-messages, perhaps that's of any help...

Posted: Fri Oct 03, 2003 8:26 am
by kendall
Derfel,

Yeah it works for me i think i got PHP4.2 except like i said it isn't matching the apostrafees(')...i don't think it's a version issue...but thats very strange if it is...what about you...are you able to match (') with your expressions?

Kendall

Posted: Fri Oct 03, 2003 8:36 am
by Derfel Cadarn
I haven't tried it with (') yet, I#ve been searching php.net for a solution about these different results. Haven't found it though.... :evil:

I'm afraid I can't work on it till sunday again, 'cause I'll be away for the weekend.... 8)
I'll give it a go then!