RSS v Site Parsing

Ye' old general discussion board. Basically, for everything that isn't covered elsewhere. Come here to shoot the breeze, shoot your mouth off, or whatever suits your fancy.
This forum is not for asking programming related questions.

Moderator: General Moderators

Post Reply
malcolmboston
DevNet Resident
Posts: 1826
Joined: Tue Nov 18, 2003 1:09 pm
Location: Middlesbrough, UK

RSS v Site Parsing

Post by malcolmboston »

im atm building a parser for web-sites to extract data i want...

however i am pretty much doing exactly what an RSS feed does except with gile_get_contents and then preg to extract the exact segments i want directly from html code, not in the RSS way.

From my first thoughts, RSS has the advantage of having pre-defined "headers" such as <item>item here</item><description>desc here</here> which makes it much more "standardised" to parse.

Is the way im doing it an acceptable method? i prefer my way as alot of sites still dont have RSS and my class has the ability to parse anything based on [passed arguments of what to parse...

your thoughts please..
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

RSS and Atom are preferable by a long margin. If the webmaster of the site you're extracting data from changes the code even a little bit then your parser will stop getting the right data, and you'll have to change it. That's the whole point of RSS: It's well defined.
malcolmboston
DevNet Resident
Posts: 1826
Joined: Tue Nov 18, 2003 1:09 pm
Location: Middlesbrough, UK

Post by malcolmboston »

that was my first thought, however the sites i require to parse have no RSS feeds and are very unlikely to ever develop them...

... would you recommend building something like

Code: Select all

if ($_GET['rss_avail'] == TRUE)
{
  // use rss methodology
}
else
{
  // use my custom parser
}
or should i just do it with my own parser?
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

If the site has an RSS feed then I would use it. If it doesn't then I'd ask them to get one. Also, be aware that some companies get terribly annoyed by people taking their data without permission. Just because it's freely available online doesn't necessarily mean you're free to reuse what you're getting.
Simulacrum
Forum Newbie
Posts: 13
Joined: Wed Apr 13, 2005 11:58 pm

Grouper

Post by Simulacrum »

Your approach would be more valuable if you were able to convert the data of interest into an RSS feed. Then, you have a world of new options (Plug. into a desktop reader, get notifications of new content, aggregate with other feeds, etc.).

I would suggest taking a look at the sourcecode for grouper http://www.geckotribe.com/rss/grouper/manual/, if you want to develop parsers for multiple sites. Grouper comes equipped with knowledge of how to harvest data from the Google and Yahoo news HTML pages.

I liked the Grouper solution because it *almost* reduced the knowledge required to parse a site down into an assoc. array containing a smart RegEx expression.

However, as noted, these solutions are fragile, since they depend on particular HTML patterns remaining present.
Post Reply