Page 1 of 1

RSS v Site Parsing

Posted: Fri May 13, 2005 4:54 am
by malcolmboston
im atm building a parser for web-sites to extract data i want...

however i am pretty much doing exactly what an RSS feed does except with gile_get_contents and then preg to extract the exact segments i want directly from html code, not in the RSS way.

From my first thoughts, RSS has the advantage of having pre-defined "headers" such as <item>item here</item><description>desc here</here> which makes it much more "standardised" to parse.

Is the way im doing it an acceptable method? i prefer my way as alot of sites still dont have RSS and my class has the ability to parse anything based on [passed arguments of what to parse...

your thoughts please..

Posted: Fri May 13, 2005 4:59 am
by onion2k
RSS and Atom are preferable by a long margin. If the webmaster of the site you're extracting data from changes the code even a little bit then your parser will stop getting the right data, and you'll have to change it. That's the whole point of RSS: It's well defined.

Posted: Fri May 13, 2005 5:05 am
by malcolmboston
that was my first thought, however the sites i require to parse have no RSS feeds and are very unlikely to ever develop them...

... would you recommend building something like

Code: Select all

if ($_GET['rss_avail'] == TRUE)
{
  // use rss methodology
}
else
{
  // use my custom parser
}
or should i just do it with my own parser?

Posted: Fri May 13, 2005 5:25 am
by onion2k
If the site has an RSS feed then I would use it. If it doesn't then I'd ask them to get one. Also, be aware that some companies get terribly annoyed by people taking their data without permission. Just because it's freely available online doesn't necessarily mean you're free to reuse what you're getting.

Grouper

Posted: Wed May 25, 2005 10:39 am
by Simulacrum
Your approach would be more valuable if you were able to convert the data of interest into an RSS feed. Then, you have a world of new options (Plug. into a desktop reader, get notifications of new content, aggregate with other feeds, etc.).

I would suggest taking a look at the sourcecode for grouper http://www.geckotribe.com/rss/grouper/manual/, if you want to develop parsers for multiple sites. Grouper comes equipped with knowledge of how to harvest data from the Google and Yahoo news HTML pages.

I liked the Grouper solution because it *almost* reduced the knowledge required to parse a site down into an assoc. array containing a smart RegEx expression.

However, as noted, these solutions are fragile, since they depend on particular HTML patterns remaining present.