Page 1 of 1
PHP Scrape
Posted: Thu May 06, 2010 10:59 am
by majorpayne
I'm looking to see if this can be achived. I understand I will have to most likely use cURL not sure beyond that
1) Scrape a RSS feed
2) Compare a Joomla Database to see if the data based on title is already exists
3) If no article is found it posts said article.
Re: PHP Scrape
Posted: Thu May 06, 2010 11:05 am
by AbraCadaver
Yes it can be achieved. There are plenty of PHP snippets out there to read and parse an XML feed. After that you do a simple select on your DB to see if the title exists, and if not do an insert.
Re: PHP Scrape
Posted: Thu May 06, 2010 11:28 am
by majorpayne
Thank you for the information... I'll look around if i have any questions I'll pipe up
Re: PHP Scrape
Posted: Thu May 06, 2010 12:51 pm
by majorpayne
Currently I'm able to pull the data into a page but since it's a RSS it's not keeping the formatting. I just want it to grab everything between the <title>,</title> and <description></description>
I know i need to use preg_match and I have looked at it but i'm dumbfounded
Re: PHP Scrape
Posted: Thu May 06, 2010 3:03 pm
by AbraCadaver
majorpayne wrote:Currently I'm able to pull the data into a page but since it's a RSS it's not keeping the formatting. I just want it to grab everything between the <title>,</title> and <description></description>
I know i need to use preg_match and I have looked at it but i'm dumbfounded
I don't know the structure of your XML feed, but:
Code: Select all
$xml = simplexml_load_file('http://www.example.com/feed.xml');
// use var_dump($xml) to see the structure, you may need something like
echo $xml->channel->item[0]->title;
// or you may need to loop through the items, etc...
Re: PHP Scrape
Posted: Thu May 06, 2010 3:58 pm
by mikosiko
majorpayne wrote:Currently I'm able to pull the data into a page but since it's a RSS it's not keeping the formatting. I just want it to grab everything between the <title>,</title> and <description></description>
I know i need to use preg_match and I have looked at it but i'm dumbfounded
Here is a script that you can use
http://www.scriptol.com/rss/rss-reader.php download it a look the examples
Re: PHP Scrape
Posted: Fri May 07, 2010 7:27 am
by majorpayne
Most grateful!
I will plug along and see what i come up with the RSS feed is in this format for the source
Code: Select all
<?xml version="1.0" encoding="iso-8859-1"?><?xml-stylesheet type="text/xsl" href="http://feeds.rapidfeeds.com/style/style3.xml"?>
<?xml-stylesheet type="text/css" href="http://feeds.rapidfeeds.com/style/style3.css"?>
<rss version="2.0" xmlns:blogChannel="http://backend.userland.com/blogChannelModule" >
<channel>
<title>MSRC Latest MS News</title>
<link>http://feeds.rapidfeeds.com/?fid4ct=3058</link>
<atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="via" href="feeds.rapidfeeds.com/3058/" type="application/rss+xml"></atom:link>
<atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.rapidfeeds.com/3058/" type="application/rss+xml" />
<description>
<![CDATA[blah blah blah info]]>
</description>
<pubDate>Thu, 08 Apr 2010 11:24:00 EST</pubDate>
<lastBuildDate>Thu, 08 Apr 2010 03:51:00 EST</lastBuildDate>
<docs>http://backend.userland.com/rss</docs>
<generator>RapidFeeds v0.1 -- http://www.rapidfeeds.com</generator>
<managingEditor></managingEditor>
<language>en</language>
<webMaster>Email addy</webMaster>
<image> <url>URL for a image i don't care about</url>
<title>Title of the feed</title>
<link>Feed Link</link>
<width>149</width>
<height>150</height>
</image>
<item>
<title>name of the title</title>
<description>bunch of blah blah blah
</description>
Re: PHP Scrape
Posted: Mon May 10, 2010 11:27 am
by majorpayne
using the code below I'm able to get the page to show.. with a title and Description. I'm seeing 2 issues
1) in the description after every period it starts a new line.
2) I'm getting  to appear in the text. I assume these are because of " or ' that are in the articles. I would like to have these removed.
Code: Select all
<pre>
<?php
$xml = simplexml_load_file('http://feeds.rapidfeeds.com/3058/');
// use var_dump($xml) or print_r($xml)to see the structure
// Uncomment this next line to test and see for yourself
// print_r($xml);
// To access the data you may need something like
// echo $xml->channel->item[0]->title; (or the field(s) that you want to use)
// or you may need to loop through the items with something like this
foreach ($xml->channel->item as $value){
// Here you can include the code that you want... per example
// validate if the title is already in you DB and proceed accordingly
echo $value->title . "<br />";
echo $value->description . "<br />";
}
?>
</pre>
Re: PHP Scrape
Posted: Mon May 10, 2010 12:42 pm
by AbraCadaver
I don't know what you mean by #1, but for #2, simplexml stores as utf-8 encoding but the feed is iso-8859-1. Try this:
Code: Select all
echo iconv('UTF-8', 'ISO-8859-1', $value->title);
echo iconv('UTF-8', 'ISO-8859-1', $value->description);
Re: PHP Scrape
Posted: Wed May 12, 2010 3:41 pm
by majorpayne
What i mean by issue #1 is that if you look at a paragraph on the nor rss feed it looks like a paragraph. If you look at the scrape that I'm pulling at the end of every sentence is a period and then the next sentence starts on the next line. there is never more then one sentence on a line no matter how short.
Re: PHP Scrape
Posted: Wed May 12, 2010 4:00 pm
by AbraCadaver
majorpayne wrote:What i mean by issue #1 is that if you look at a paragraph on the nor rss feed it looks like a paragraph. If you look at the scrape that I'm pulling at the end of every sentence is a period and then the next sentence starts on the next line. there is never more then one sentence on a line no matter how short.
Remove the <pre> tags?
Re: PHP Scrape
Posted: Wed May 19, 2010 1:28 pm
by majorpayne
Without the pre tags it becomes a jumbled mess. well I think I'm going to hang my hat on this one.. I can't seem to get it done the way i need it...