Page 1 of 1

how to Analyse article content from a article page?

Posted: Sat Nov 15, 2008 10:15 pm
by kenvin
I need to Analyse article content in news page.
e.g.:
The news page http://news.yahoo.com/s/ap/20081116/ap_ ... own_summit,
when I get the html, I need Analyse text content as the follow:

Code: Select all

WASHINGTON – World leaders battling a dire and deepening economic crisis vowed Saturday to cooperate more closely, keep a sharper eye out for red-flag problems and give bigger roles to fast-rising nations — but kicked many hard details down the road for their next summit after President-elect Barack Obama takes office.
 
Perhaps as important as the modest concrete steps they took, the leaders of the planet's richest nations — and some of the fastest-developing — made clear their recognition of the world's increasingly interconnected financial architecture and the responsibilities that go along with it.
...
 
Also significant at the summit: the inclusion of a far broader range of countries than the elite, old-guard group that usually holds such summit meetings.

now I use grep to Analyse html tags as table, tr, div .
The result such as http://202.106.63.4/cp/page.php?referur ... own_summit

who can give me some good methods? thanks very much.

Re: how to Analyse article content from a article page?

Posted: Sat Nov 15, 2008 10:26 pm
by requinix
Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.

Re: how to Analyse article content from a article page?

Posted: Sat Nov 15, 2008 10:32 pm
by kenvin
tasairis wrote:
Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.
yes, it's use Regex to Analyse html tags only. I need a better way.

Re: how to Analyse article content from a article page?

Posted: Sat Nov 15, 2008 10:47 pm
by requinix
Maybe you didn't understand the first time?
Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.

Re: how to Analyse article content from a article page?

Posted: Sat Nov 15, 2008 10:51 pm
by kenvin
tasairis wrote:Maybe you didn't understand the first time?
Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.
It's just an example. I only need a way to Analyse the article content.

Re: how to Analyse article content from a article page?

Posted: Sat Nov 15, 2008 10:53 pm
by Syntac
Regardless of your intentions, you are not allowed to post that article without permission of The Associated Press.

Re: how to Analyse article content from a article page?

Posted: Sat Nov 15, 2008 11:02 pm
by kenvin
oh my god.

now,
we
are
only
talk
about
technical way,

that's news page is just an example!

understand? !

Re: how to Analyse article content from a article page?

Posted: Sat Nov 15, 2008 11:07 pm
by Syntac
If you at least make an effort to obey simple copyright rules, I'll help you.

Re: how to Analyse article content from a article page?

Posted: Sat Nov 15, 2008 11:31 pm
by kenvin
Syntac wrote:If you at least make an effort to obey simple copyright rules, I'll help you.
the following it's my own blog's analyse resutlts
http://202.106.63.4/cp/page.php?referur ... es/10.html

Re: how to Analyse article content from a article page?

Posted: Sun Nov 16, 2008 12:44 am
by requinix
If you want something simple to get the text content of the page you can use strip_tags to get rid of all the HTML tags that you don't want.
It means you can keep tags like <h#> and <p> and get rid of all the others.

Code: Select all

$html = file_get_contents("http://example.com/page.html");
$text = strip_tags($html, "<h1><h2><h3><p>");
 
echo $text;
To get the stuff between <body> tags you can use a regular expression.

Code: Select all

$html = preg_replace('/<body[^>]*>(.*?)<\/body>/is', '$1', $html);
Just remember that whatever you do, if you scrap specific sites then you must follow copyright laws and their terms of use.

Re: how to Analyse article content from a article page?

Posted: Sun Nov 16, 2008 12:53 am
by kenvin
I need only article contents. NOT html body.

Please don't use so simple code show your ability.

please read my topic in real earnest first

Re: how to Analyse article content from a article page?

Posted: Sun Nov 16, 2008 1:16 am
by requinix
kenvin wrote:I need only article contents. NOT html body.

Please don't use so simple code show your ability.

please read my topic in real earnest first
You can't do what you want perfectly unless you target specific sites. Y! News does their thing differently than MSNBC, and they both do it differently than CNN's website.

So pick one: a generic solution that works with just about anything, or a specific solution that only works on a handful of sites.

Oh, if you want the page title

Code: Select all

preg_match('/<title>(.*?)<\/title>/i', $html, $m);
$m[1] is the page title.
And I told you how to get the <body> content. If that's not the "content" you want then you need to be more specific.