how to Analyse article content from a article page?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
kenvin
Forum Newbie
Posts: 6
Joined: Sat Nov 15, 2008 10:05 pm

how to Analyse article content from a article page?

Post by kenvin »

I need to Analyse article content in news page.
e.g.:
The news page http://news.yahoo.com/s/ap/20081116/ap_ ... own_summit,
when I get the html, I need Analyse text content as the follow:

Code: Select all

WASHINGTON – World leaders battling a dire and deepening economic crisis vowed Saturday to cooperate more closely, keep a sharper eye out for red-flag problems and give bigger roles to fast-rising nations — but kicked many hard details down the road for their next summit after President-elect Barack Obama takes office.
 
Perhaps as important as the modest concrete steps they took, the leaders of the planet's richest nations — and some of the fastest-developing — made clear their recognition of the world's increasingly interconnected financial architecture and the responsibilities that go along with it.
...
 
Also significant at the summit: the inclusion of a far broader range of countries than the elite, old-guard group that usually holds such summit meetings.

now I use grep to Analyse html tags as table, tr, div .
The result such as http://202.106.63.4/cp/page.php?referur ... own_summit

who can give me some good methods? thanks very much.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: how to Analyse article content from a article page?

Post by requinix »

Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.
kenvin
Forum Newbie
Posts: 6
Joined: Sat Nov 15, 2008 10:05 pm

Re: how to Analyse article content from a article page?

Post by kenvin »

tasairis wrote:
Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.
yes, it's use Regex to Analyse html tags only. I need a better way.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: how to Analyse article content from a article page?

Post by requinix »

Maybe you didn't understand the first time?
Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.
kenvin
Forum Newbie
Posts: 6
Joined: Sat Nov 15, 2008 10:05 pm

Re: how to Analyse article content from a article page?

Post by kenvin »

tasairis wrote:Maybe you didn't understand the first time?
Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.
It's just an example. I only need a way to Analyse the article content.
User avatar
Syntac
Forum Contributor
Posts: 327
Joined: Sun Sep 14, 2008 7:59 pm

Re: how to Analyse article content from a article page?

Post by Syntac »

Regardless of your intentions, you are not allowed to post that article without permission of The Associated Press.
kenvin
Forum Newbie
Posts: 6
Joined: Sat Nov 15, 2008 10:05 pm

Re: how to Analyse article content from a article page?

Post by kenvin »

oh my god.

now,
we
are
only
talk
about
technical way,

that's news page is just an example!

understand? !
User avatar
Syntac
Forum Contributor
Posts: 327
Joined: Sun Sep 14, 2008 7:59 pm

Re: how to Analyse article content from a article page?

Post by Syntac »

If you at least make an effort to obey simple copyright rules, I'll help you.
kenvin
Forum Newbie
Posts: 6
Joined: Sat Nov 15, 2008 10:05 pm

Re: how to Analyse article content from a article page?

Post by kenvin »

Syntac wrote:If you at least make an effort to obey simple copyright rules, I'll help you.
the following it's my own blog's analyse resutlts
http://202.106.63.4/cp/page.php?referur ... es/10.html
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: how to Analyse article content from a article page?

Post by requinix »

If you want something simple to get the text content of the page you can use strip_tags to get rid of all the HTML tags that you don't want.
It means you can keep tags like <h#> and <p> and get rid of all the others.

Code: Select all

$html = file_get_contents("http://example.com/page.html");
$text = strip_tags($html, "<h1><h2><h3><p>");
 
echo $text;
To get the stuff between <body> tags you can use a regular expression.

Code: Select all

$html = preg_replace('/<body[^>]*>(.*?)<\/body>/is', '$1', $html);
Just remember that whatever you do, if you scrap specific sites then you must follow copyright laws and their terms of use.
kenvin
Forum Newbie
Posts: 6
Joined: Sat Nov 15, 2008 10:05 pm

Re: how to Analyse article content from a article page?

Post by kenvin »

I need only article contents. NOT html body.

Please don't use so simple code show your ability.

please read my topic in real earnest first
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: how to Analyse article content from a article page?

Post by requinix »

kenvin wrote:I need only article contents. NOT html body.

Please don't use so simple code show your ability.

please read my topic in real earnest first
You can't do what you want perfectly unless you target specific sites. Y! News does their thing differently than MSNBC, and they both do it differently than CNN's website.

So pick one: a generic solution that works with just about anything, or a specific solution that only works on a handful of sites.

Oh, if you want the page title

Code: Select all

preg_match('/<title>(.*?)<\/title>/i', $html, $m);
$m[1] is the page title.
And I told you how to get the <body> content. If that's not the "content" you want then you need to be more specific.
Post Reply