PHP Scrape

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
majorpayne
Forum Newbie
Posts: 7
Joined: Thu May 06, 2010 10:54 am

PHP Scrape

Post by majorpayne »

I'm looking to see if this can be achived. I understand I will have to most likely use cURL not sure beyond that

1) Scrape a RSS feed
2) Compare a Joomla Database to see if the data based on title is already exists
3) If no article is found it posts said article.
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: PHP Scrape

Post by AbraCadaver »

Yes it can be achieved. There are plenty of PHP snippets out there to read and parse an XML feed. After that you do a simple select on your DB to see if the title exists, and if not do an insert.
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
majorpayne
Forum Newbie
Posts: 7
Joined: Thu May 06, 2010 10:54 am

Re: PHP Scrape

Post by majorpayne »

Thank you for the information... I'll look around if i have any questions I'll pipe up
majorpayne
Forum Newbie
Posts: 7
Joined: Thu May 06, 2010 10:54 am

Re: PHP Scrape

Post by majorpayne »

Currently I'm able to pull the data into a page but since it's a RSS it's not keeping the formatting. I just want it to grab everything between the <title>,</title> and <description></description>

I know i need to use preg_match and I have looked at it but i'm dumbfounded
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: PHP Scrape

Post by AbraCadaver »

majorpayne wrote:Currently I'm able to pull the data into a page but since it's a RSS it's not keeping the formatting. I just want it to grab everything between the <title>,</title> and <description></description>

I know i need to use preg_match and I have looked at it but i'm dumbfounded
I don't know the structure of your XML feed, but:

Code: Select all

$xml = simplexml_load_file('http://www.example.com/feed.xml');
// use var_dump($xml) to see the structure, you may need something like
echo $xml->channel->item[0]->title;
// or you may need to loop through the items, etc...
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
mikosiko
Forum Regular
Posts: 757
Joined: Wed Jan 13, 2010 7:22 pm

Re: PHP Scrape

Post by mikosiko »

majorpayne wrote:Currently I'm able to pull the data into a page but since it's a RSS it's not keeping the formatting. I just want it to grab everything between the <title>,</title> and <description></description>

I know i need to use preg_match and I have looked at it but i'm dumbfounded
Here is a script that you can use http://www.scriptol.com/rss/rss-reader.php download it a look the examples
majorpayne
Forum Newbie
Posts: 7
Joined: Thu May 06, 2010 10:54 am

Re: PHP Scrape

Post by majorpayne »

Most grateful!

I will plug along and see what i come up with the RSS feed is in this format for the source

Code: Select all

<?xml version="1.0" encoding="iso-8859-1"?><?xml-stylesheet type="text/xsl" href="http://feeds.rapidfeeds.com/style/style3.xml"?>
<?xml-stylesheet type="text/css" href="http://feeds.rapidfeeds.com/style/style3.css"?>
<rss version="2.0" xmlns:blogChannel="http://backend.userland.com/blogChannelModule" >
  <channel>
	<title>MSRC  Latest MS News</title>
    <link>http://feeds.rapidfeeds.com/?fid4ct=3058</link>
    <atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="via" href="feeds.rapidfeeds.com/3058/" type="application/rss+xml"></atom:link>
    <atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.rapidfeeds.com/3058/" type="application/rss+xml" />
    <description>

        <![CDATA[blah blah blah info]]>
    </description>
    <pubDate>Thu, 08 Apr 2010 11:24:00 EST</pubDate>
    <lastBuildDate>Thu, 08 Apr 2010 03:51:00 EST</lastBuildDate>
    <docs>http://backend.userland.com/rss</docs>
    <generator>RapidFeeds v0.1 -- http://www.rapidfeeds.com</generator>
    <managingEditor></managingEditor>

    <language>en</language>
<webMaster>Email addy</webMaster>
    <image>      <url>URL for a image i don't care about</url>
      <title>Title of the feed</title>
      <link>Feed Link</link>
      <width>149</width> 
      <height>150</height>

    </image>
    <item>
      <title>name of the title</title>
      <description>bunch of blah blah blah

    </description>
majorpayne
Forum Newbie
Posts: 7
Joined: Thu May 06, 2010 10:54 am

Re: PHP Scrape

Post by majorpayne »

using the code below I'm able to get the page to show.. with a title and Description. I'm seeing 2 issues

1) in the description after every period it starts a new line.
2) I'm getting  to appear in the text. I assume these are because of " or ' that are in the articles. I would like to have these removed.

Code: Select all

<pre>
<?php
$xml = simplexml_load_file('http://feeds.rapidfeeds.com/3058/');

// use var_dump($xml) or print_r($xml)to see the structure

// Uncomment this next line to test and see for yourself
// print_r($xml);

// To access the data you may need something like
// echo $xml->channel->item[0]->title;  (or the field(s) that you want to use)
// or you may need to loop through the items with something like this

foreach ($xml->channel->item as $value){
       
        // Here you can include the code that you want... per example
        // validate if the title is already in you DB and proceed accordingly
    echo $value->title . "<br />";
    echo $value->description . "<br />";

}
?>
</pre>
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: PHP Scrape

Post by AbraCadaver »

I don't know what you mean by #1, but for #2, simplexml stores as utf-8 encoding but the feed is iso-8859-1. Try this:

Code: Select all

echo iconv('UTF-8', 'ISO-8859-1', $value->title);
echo iconv('UTF-8', 'ISO-8859-1', $value->description);
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
majorpayne
Forum Newbie
Posts: 7
Joined: Thu May 06, 2010 10:54 am

Re: PHP Scrape

Post by majorpayne »

What i mean by issue #1 is that if you look at a paragraph on the nor rss feed it looks like a paragraph. If you look at the scrape that I'm pulling at the end of every sentence is a period and then the next sentence starts on the next line. there is never more then one sentence on a line no matter how short.
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: PHP Scrape

Post by AbraCadaver »

majorpayne wrote:What i mean by issue #1 is that if you look at a paragraph on the nor rss feed it looks like a paragraph. If you look at the scrape that I'm pulling at the end of every sentence is a period and then the next sentence starts on the next line. there is never more then one sentence on a line no matter how short.
Remove the <pre> tags?
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
majorpayne
Forum Newbie
Posts: 7
Joined: Thu May 06, 2010 10:54 am

Re: PHP Scrape

Post by majorpayne »

Without the pre tags it becomes a jumbled mess. well I think I'm going to hang my hat on this one.. I can't seem to get it done the way i need it...
Post Reply