Approaching large RSS project

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
steel_rose
Forum Newbie
Posts: 15
Joined: Thu May 13, 2010 9:17 am

Approaching large RSS project

Post by steel_rose »

Hello everybody,

I'm trying to put together an application that will gather information from several major rss feeds.
Users will be able to retrieve the feeds sorting by subject, source or a more complex set of options and filters.

I have serious concerns regarding the performance of the application, as the number of sources that information is gathered from is relatively high (e.g. 20), the feeds are updated several times a day and the user will need to be able to search the content of the feeds to display only certain news items.

I was wondering...
There are so many scripts to access RSS feeds (MagPie, SimplePie, Rss2html), does anyone have any reccomendation? Or should I start from SimpleXML and extend it as I see fit?
Should I save the feeds in a database and access them from there? Would that make display and search operations easier?
Caching... do you think I will need it? How should I approach it?

Thank you very much for any pointers/suggestions that might be given!

SR
User avatar
mecha_godzilla
Forum Contributor
Posts: 375
Joined: Wed Apr 14, 2010 4:45 pm
Location: UK

Re: Approaching large RSS project

Post by mecha_godzilla »

Some suggestions:

1. Definitely store this information in a database - you'll benefit from whatever search tools your DB offers.

2. Whenever you store a feed, extract the timestamp from it and put it in a separate column in your record; this will make it very easy to find the newest version of the feed and also expire old ones. You could also generate a hash against the feed - you only have to insert the feed again if the data has changed, rather than doing this automatically at set intervals even if the feed is exactly the same.

3. Write a cron job or DB trigger to expire old feeds, or use some unique information in the feed to UPDATE the record rather than keep adding newer instances of the same feed and having to expire older ones.

4. Caching would definitely make sense - I'm not sure where the best place to store this would be though :D Again, use things like hashes and timestamps to keep the cache current.

5. I think Magpie is worth looking at initially - I don't really use it but it's useful for learning how cURL and XML parsing works.

One last point, remember that some feeds are compressed (GZIPped I think) so you need to take account of this - I wrote a post about this here

viewtopic.php?f=1&t=116738

but (in essence) you have to just send an additional piece of header information in your request that specifies what compression types are supported. If you don't do this and the feed is compressed you'll usually get redirected to an error message, so it's fairly obvious to spot. If the feed is compressed then you'll obviously need to decompress it as well (you might want to check if Magpie does this already).

HTH,

Mecha Godzilla
User avatar
DaveTheAve
Forum Contributor
Posts: 385
Joined: Tue Oct 03, 2006 2:25 pm
Location: 127.0.0.1
Contact:

Re: Approaching large RSS project

Post by DaveTheAve »

@ Point # 6. That's exactly why I try to program with cURL as much as possible. It is so robust and makes site-scraping a breeze. Especially great for scraping sections of sites that require authentication as it easily handles cookies.

On an unethical note,
Couldn't have bypassed the new Rapidshare security system without cURL.... it's just that powerful.
steel_rose
Forum Newbie
Posts: 15
Joined: Thu May 13, 2010 9:17 am

Re: Approaching large RSS project

Post by steel_rose »

Thank you very much for all the information, it clarified some doubts and pointed my exploration in new directions.
I have been looking into cURL and the curl_multi_* function family, they seem very powerful and useful.
My project is one little step closer now :D

SR
User avatar
DaveTheAve
Forum Contributor
Posts: 385
Joined: Tue Oct 03, 2006 2:25 pm
Location: 127.0.0.1
Contact:

Re: Approaching large RSS project

Post by DaveTheAve »

Forgot to mention this earlier and I apologize, but I feel it's worth mentioning that cURL is enabled by default on most servers today.
Post Reply