Page 1 of 1

RSS to JSON

Posted: Mon Sep 22, 2008 5:31 pm
by VirtuosiMedia
I'm working on an RSS reader and I'm running into a little snag, so I thought I'd ask to see what people thought the best approach might be. I want to convert it to JSON before I turn it over to the JavaScript, so I'm using the simplexml and JSON functions. It's working okay except that it also converts any embedded HTML into JSON as well, which makes it hard for the JavaScript class I wrote for it. What I'm thinking about doing is entity-encoding everything inside the (RSS) description tag, but I'm not quite sure the best way to do that or if that's the best solution because I might run into some problems if the feed is about HTML or JavaScript. Any suggestions?

Here's my code so far:

Code: Select all

$url = urldecode($_POST['url']);
$rawFeed = file_get_contents($url);
$rawFeed = preg_replace('#<!\[CDATA\[(.*)\]\]>#', '', $rawFeed);
$feed = simplexml_load_string($rawFeed);
echo json_encode($feed);
On a side note, should I be stripping script tags as well? Also, if you have any further suggestions for this, I'd also like to hear them. Thanks.

Re: RSS to JSON

Posted: Mon Sep 22, 2008 5:44 pm
by josh
I'd start by pinpointing where the actual logic is breaking down, now I didn't reproduce this locally so this is just a hunch

but the problem has to be either

simpleXML is interpreting the HTML as XML, if this is the case you need better feed parsing

or

I don't know enough technically about JSON, but I believe a string of HTML is perfectly legal, I imagine an easy way to bypass that if it wasn't would be to base 64 encode the string prior to JSON encoding your view

...

HTML filtering is another topic, I use htmlpurifier to handle that for me personally

Re: RSS to JSON

Posted: Mon Sep 22, 2008 5:55 pm
by VirtuosiMedia
The problem is that simplexml is interpreting the HTML as XML, which, I guess it should. I'm just trying to figure out the best way for it not to because I don't want the HTML nodes to get converted as parts of the eventual JSON object, I want them as a string instead. However, I do want to convert everything else into the object.

Re: RSS to JSON

Posted: Mon Sep 22, 2008 6:05 pm
by josh
did you escape the html with cdata tags? why are you replacing the cdata tags if you escape it with cdata it should be interpreted literally instead of parsed by the DOM engine

Re: RSS to JSON

Posted: Mon Sep 22, 2008 6:14 pm
by VirtuosiMedia
The reason why I'm replacing them is because the cdata tag becomes an object when I use the json_encode function. However, I don't want it or any of the HTML tags that it encloses to become an object, I want them to remain a string and plus the cdata part is useless once the JavaScript gets it anyways.

Re: RSS to JSON

Posted: Mon Sep 22, 2008 6:40 pm
by josh
remove it after you parse the xml, before you construct your json object
the manual wrote:In an XML document or external parsed entity, a CDATA section is a section of element content that is marked for the parser to interpret as only character data, not markup

Re: RSS to JSON

Posted: Mon Sep 22, 2008 7:01 pm
by VirtuosiMedia
jshpro2 wrote:remove it after you parse the xml, before you construct your json object
The cdata part hasn't been the problem at all, it's been everything else. I've updated my code a little, so I'll post it here.

Code: Select all

$url = urldecode($_POST['url']);
$rawFeed = file_get_contents('http://feeds.feedburner.com/SmashingMagazine'); //This url is just for testing
$rawFeed = preg_replace('#<!\[CDATA\[#', '', $rawFeed);
$rawFeed = preg_replace('#\]\]\>\</#', '</', $rawFeed);
$pattern[0] = "/\<description\>(.*?)\<\/description\>/is";
$replace[0] = '<description>'.htmlentities("$1").'</description>'; //I'd like to entity encode all of the html data within the description tags, but I'm having some problems with that
$rawFeed = preg_replace($pattern, $replace, $rawFeed);
//$feed = simplexml_load_string($rawFeed);
//echo json_encode($feed);
echo $rawFeed; //Just for testing

Re: RSS to JSON

Posted: Mon Sep 22, 2008 7:28 pm
by josh
VirtuosiMedia wrote:The cdata part hasn't been the problem at all
You said the issue was that the simpleXML is interpreting your markup as part of the XML and not a literal string, the solution to this is using cdata. The code you posted defeats cdata
VirtuosiMedia wrote: $replace[0] = '<description>'.htmlentities("$1").'</description>'; //I'd like to entity encode all of the html data within the description tags, but I'm having some problems with that
htmlentities is the proper function, but what is "$1", and you haven't mentioned what problems html entity encoding is giving you, you asked what the best approach might be so I was telling you, html entity encoding is entirely different from your first problem, which is that simplexml is interpreting your data wrong.. which is what you said

Re: RSS to JSON

Posted: Tue Sep 23, 2008 12:20 am
by VirtuosiMedia
I'm using this code to parse an RSS feed and I would like to transform the RSS tags into JSON, with each description tag's value as a literal string. Many RSS feeds already use cdata so that they can format their feed with HTML. However, when I use the simplexml functions, they also translate the cdata and the HTML into JSON, when I need the HTML as a string, not parsed. The cdata is unnecessary for my purposes, so I'm stripping it and that was never an issue. The issue is that I also to make sure that the HTML, because it is also XML, isn't translated into JSON. What I need to know is the best way to go about doing that. That was the question I've been asking about since the beginning, though maybe not as clearly as I thought I was asking it. At this point, I'm thinking that using htmlentities will prevent the HTML from being converted to JSON and once the feed has been converted, I can decode the HTML. I'm just looking for the best way to do that and so that's why the $1 is present in the code I presented, because I was trying to replace the HTML with entity encoded HTML using preg_replace(). I'm okay at simple regex, but I haven't worked with replace functions very much, so I'm not sure the best way to go about it.