Page 1 of 1

[Solved] How do I parse out HTML body text?

Posted: Wed Mar 31, 2004 10:43 am
by chinagirl
I have a simple enough html file (myfile.html). It looks something like this:
<html>
<head>
<meta...>
<link...>
<title>...</title>
<style>...</style>
</head>
<body>
<h>ABC</h>
<p>XYZ</P>
</body>
</html>

I read this in my php code, such as:
<?php
$file='myfile.html';
$fp=fopen($file, 'r');
$contents = fread ($fp, filesize ($file));
close ($fp);
?>

But instead of reading entire file, I only want to read the portion in html <body>..</body>. Further more, I want to parse out text in <h>...</h> vs. <p>...</p>.

Can anyone provide an example of how to do this? Thanks much.

Posted: Wed Mar 31, 2004 11:11 am
by kettle_drum
Just read the whole file, and then keep on parsing it. Say explode('<body'>, $file); or something and then parcing it until you have what you want.

Posted: Wed Mar 31, 2004 11:15 am
by Illusionist
exploding the <body> tag will do nothing but split it into 2 parts. Not very helpful. It would be better to use regular expressions. Or just use [php_man]substr()[/php_man], [php_man]strpos()[/php_man] and other string functions to parse through the file and get what you want.

I would recomend researching on regular expressions though, as theyhelp a lot!
If i get time later, i'll see if i can get some regexp's working for you.

thanks for the tip

Posted: Thu Apr 01, 2004 8:05 am
by chinagirl
explode did not work well. Neither does any singel expression. I used combination of fgets, strist and eregi, it kind of worked but still, it is not dynamic enough for me. I guess I will do some more research. Thank you for your reply.

Posted: Thu Apr 01, 2004 8:22 am
by patrikG
Sounds very much as if you'd want be parsing HTML as an instance of XML.

Have a look at http://sourceforge.net/projects/php-html/
http://sourceforge.net/projects/php-html/ wrote:Object oriented PHP based HTML parser. The HtmlParser class allows you to interate through HTML nodes and get their attributes, names and values. It also comes with an example class for converting HTML to formatted ASCII text.

Posted: Fri Apr 02, 2004 5:53 pm
by chinagirl
That parser worked. Thanks Patrick.