Page 1 of 1
how to Analyse article content from a article page?
Posted: Sat Nov 15, 2008 10:15 pm
by kenvin
I need to Analyse article content in news page.
e.g.:
The news page
http://news.yahoo.com/s/ap/20081116/ap_ ... own_summit,
when I get the html, I need Analyse text content as the follow:
Code: Select all
WASHINGTON – World leaders battling a dire and deepening economic crisis vowed Saturday to cooperate more closely, keep a sharper eye out for red-flag problems and give bigger roles to fast-rising nations — but kicked many hard details down the road for their next summit after President-elect Barack Obama takes office.
Perhaps as important as the modest concrete steps they took, the leaders of the planet's richest nations — and some of the fastest-developing — made clear their recognition of the world's increasingly interconnected financial architecture and the responsibilities that go along with it.
...
Also significant at the summit: the inclusion of a far broader range of countries than the elite, old-guard group that usually holds such summit meetings.
now I use grep to Analyse html tags as table, tr, div .
The result such as
http://202.106.63.4/cp/page.php?referur ... own_summit
who can give me some good methods? thanks very much.
Re: how to Analyse article content from a article page?
Posted: Sat Nov 15, 2008 10:26 pm
by requinix
Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.
Re: how to Analyse article content from a article page?
Posted: Sat Nov 15, 2008 10:32 pm
by kenvin
tasairis wrote:Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.
yes, it's use Regex to Analyse html tags only. I need a better way.
Re: how to Analyse article content from a article page?
Posted: Sat Nov 15, 2008 10:47 pm
by requinix
Maybe you didn't understand the first time?
Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.
Re: how to Analyse article content from a article page?
Posted: Sat Nov 15, 2008 10:51 pm
by kenvin
tasairis wrote:Maybe you didn't understand the first time?
Copyright © 2008 The Associated Press. All rights reserved. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press.
It's just an example. I only need a way to Analyse the article content.
Re: how to Analyse article content from a article page?
Posted: Sat Nov 15, 2008 10:53 pm
by Syntac
Regardless of your intentions, you are not allowed to post that article without permission of The Associated Press.
Re: how to Analyse article content from a article page?
Posted: Sat Nov 15, 2008 11:02 pm
by kenvin
oh my god.
now,
we
are
only
talk
about
technical way,
that's news page is just an example!
understand? !
Re: how to Analyse article content from a article page?
Posted: Sat Nov 15, 2008 11:07 pm
by Syntac
If you at least make an effort to obey simple copyright rules, I'll help you.
Re: how to Analyse article content from a article page?
Posted: Sat Nov 15, 2008 11:31 pm
by kenvin
Syntac wrote:If you at least make an effort to obey simple copyright rules, I'll help you.
the following it's my own blog's analyse resutlts
http://202.106.63.4/cp/page.php?referur ... es/10.html
Re: how to Analyse article content from a article page?
Posted: Sun Nov 16, 2008 12:44 am
by requinix
If you want something simple to get the text content of the page you can use
strip_tags to get rid of all the HTML tags that you don't want.
It means you can keep tags like <h#> and <p> and get rid of all the others.
Code: Select all
$html = file_get_contents("http://example.com/page.html");
$text = strip_tags($html, "<h1><h2><h3><p>");
echo $text;
To get the stuff between <body> tags you can use a regular expression.
Code: Select all
$html = preg_replace('/<body[^>]*>(.*?)<\/body>/is', '$1', $html);
Just remember that whatever you do, if you scrap specific sites then you must follow copyright laws and their terms of use.
Re: how to Analyse article content from a article page?
Posted: Sun Nov 16, 2008 12:53 am
by kenvin
I need only article contents. NOT html body.
Please don't use so simple code show your ability.
please read my topic in real earnest first
Re: how to Analyse article content from a article page?
Posted: Sun Nov 16, 2008 1:16 am
by requinix
kenvin wrote:I need only article contents. NOT html body.
Please don't use so simple code show your ability.
please read my topic in real earnest first
You can't do what you want perfectly unless you target specific sites. Y! News does their thing differently than MSNBC, and they both do it differently than CNN's website.
So pick one: a generic solution that works with just about anything, or a specific solution that only works on a handful of sites.
Oh, if you want the page title
Code: Select all
preg_match('/<title>(.*?)<\/title>/i', $html, $m);
$m[1] is the page title.
And I told you how to get the <body> content. If that's not the "content" you want then you need to be more specific.