Page 1 of 1

PHP Scrape

Posted: Thu May 06, 2010 10:59 am
by majorpayne
I'm looking to see if this can be achived. I understand I will have to most likely use cURL not sure beyond that

1) Scrape a RSS feed
2) Compare a Joomla Database to see if the data based on title is already exists
3) If no article is found it posts said article.

Re: PHP Scrape

Posted: Thu May 06, 2010 11:05 am
by AbraCadaver
Yes it can be achieved. There are plenty of PHP snippets out there to read and parse an XML feed. After that you do a simple select on your DB to see if the title exists, and if not do an insert.

Re: PHP Scrape

Posted: Thu May 06, 2010 11:28 am
by majorpayne
Thank you for the information... I'll look around if i have any questions I'll pipe up

Re: PHP Scrape

Posted: Thu May 06, 2010 12:51 pm
by majorpayne
Currently I'm able to pull the data into a page but since it's a RSS it's not keeping the formatting. I just want it to grab everything between the <title>,</title> and <description></description>

I know i need to use preg_match and I have looked at it but i'm dumbfounded

Re: PHP Scrape

Posted: Thu May 06, 2010 3:03 pm
by AbraCadaver
majorpayne wrote:Currently I'm able to pull the data into a page but since it's a RSS it's not keeping the formatting. I just want it to grab everything between the <title>,</title> and <description></description>

I know i need to use preg_match and I have looked at it but i'm dumbfounded
I don't know the structure of your XML feed, but:

Code: Select all

$xml = simplexml_load_file('http://www.example.com/feed.xml');
// use var_dump($xml) to see the structure, you may need something like
echo $xml->channel->item[0]->title;
// or you may need to loop through the items, etc...

Re: PHP Scrape

Posted: Thu May 06, 2010 3:58 pm
by mikosiko
majorpayne wrote:Currently I'm able to pull the data into a page but since it's a RSS it's not keeping the formatting. I just want it to grab everything between the <title>,</title> and <description></description>

I know i need to use preg_match and I have looked at it but i'm dumbfounded
Here is a script that you can use http://www.scriptol.com/rss/rss-reader.php download it a look the examples

Re: PHP Scrape

Posted: Fri May 07, 2010 7:27 am
by majorpayne
Most grateful!

I will plug along and see what i come up with the RSS feed is in this format for the source

Code: Select all

<?xml version="1.0" encoding="iso-8859-1"?><?xml-stylesheet type="text/xsl" href="http://feeds.rapidfeeds.com/style/style3.xml"?>
<?xml-stylesheet type="text/css" href="http://feeds.rapidfeeds.com/style/style3.css"?>
<rss version="2.0" xmlns:blogChannel="http://backend.userland.com/blogChannelModule" >
  <channel>
	<title>MSRC  Latest MS News</title>
    <link>http://feeds.rapidfeeds.com/?fid4ct=3058</link>
    <atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="via" href="feeds.rapidfeeds.com/3058/" type="application/rss+xml"></atom:link>
    <atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.rapidfeeds.com/3058/" type="application/rss+xml" />
    <description>

        <![CDATA[blah blah blah info]]>
    </description>
    <pubDate>Thu, 08 Apr 2010 11:24:00 EST</pubDate>
    <lastBuildDate>Thu, 08 Apr 2010 03:51:00 EST</lastBuildDate>
    <docs>http://backend.userland.com/rss</docs>
    <generator>RapidFeeds v0.1 -- http://www.rapidfeeds.com</generator>
    <managingEditor></managingEditor>

    <language>en</language>
<webMaster>Email addy</webMaster>
    <image>      <url>URL for a image i don't care about</url>
      <title>Title of the feed</title>
      <link>Feed Link</link>
      <width>149</width> 
      <height>150</height>

    </image>
    <item>
      <title>name of the title</title>
      <description>bunch of blah blah blah

    </description>

Re: PHP Scrape

Posted: Mon May 10, 2010 11:27 am
by majorpayne
using the code below I'm able to get the page to show.. with a title and Description. I'm seeing 2 issues

1) in the description after every period it starts a new line.
2) I'm getting  to appear in the text. I assume these are because of " or ' that are in the articles. I would like to have these removed.

Code: Select all

<pre>
<?php
$xml = simplexml_load_file('http://feeds.rapidfeeds.com/3058/');

// use var_dump($xml) or print_r($xml)to see the structure

// Uncomment this next line to test and see for yourself
// print_r($xml);

// To access the data you may need something like
// echo $xml->channel->item[0]->title;  (or the field(s) that you want to use)
// or you may need to loop through the items with something like this

foreach ($xml->channel->item as $value){
       
        // Here you can include the code that you want... per example
        // validate if the title is already in you DB and proceed accordingly
    echo $value->title . "<br />";
    echo $value->description . "<br />";

}
?>
</pre>

Re: PHP Scrape

Posted: Mon May 10, 2010 12:42 pm
by AbraCadaver
I don't know what you mean by #1, but for #2, simplexml stores as utf-8 encoding but the feed is iso-8859-1. Try this:

Code: Select all

echo iconv('UTF-8', 'ISO-8859-1', $value->title);
echo iconv('UTF-8', 'ISO-8859-1', $value->description);

Re: PHP Scrape

Posted: Wed May 12, 2010 3:41 pm
by majorpayne
What i mean by issue #1 is that if you look at a paragraph on the nor rss feed it looks like a paragraph. If you look at the scrape that I'm pulling at the end of every sentence is a period and then the next sentence starts on the next line. there is never more then one sentence on a line no matter how short.

Re: PHP Scrape

Posted: Wed May 12, 2010 4:00 pm
by AbraCadaver
majorpayne wrote:What i mean by issue #1 is that if you look at a paragraph on the nor rss feed it looks like a paragraph. If you look at the scrape that I'm pulling at the end of every sentence is a period and then the next sentence starts on the next line. there is never more then one sentence on a line no matter how short.
Remove the <pre> tags?

Re: PHP Scrape

Posted: Wed May 19, 2010 1:28 pm
by majorpayne
Without the pre tags it becomes a jumbled mess. well I think I'm going to hang my hat on this one.. I can't seem to get it done the way i need it...