im atm building a parser for web-sites to extract data i want...
however i am pretty much doing exactly what an RSS feed does except with gile_get_contents and then preg to extract the exact segments i want directly from html code, not in the RSS way.
From my first thoughts, RSS has the advantage of having pre-defined "headers" such as <item>item here</item><description>desc here</here> which makes it much more "standardised" to parse.
Is the way im doing it an acceptable method? i prefer my way as alot of sites still dont have RSS and my class has the ability to parse anything based on [passed arguments of what to parse...
your thoughts please..
RSS v Site Parsing
Moderator: General Moderators
-
malcolmboston
- DevNet Resident
- Posts: 1826
- Joined: Tue Nov 18, 2003 1:09 pm
- Location: Middlesbrough, UK
-
malcolmboston
- DevNet Resident
- Posts: 1826
- Joined: Tue Nov 18, 2003 1:09 pm
- Location: Middlesbrough, UK
that was my first thought, however the sites i require to parse have no RSS feeds and are very unlikely to ever develop them...
... would you recommend building something like
or should i just do it with my own parser?
... would you recommend building something like
Code: Select all
if ($_GET['rss_avail'] == TRUE)
{
// use rss methodology
}
else
{
// use my custom parser
}If the site has an RSS feed then I would use it. If it doesn't then I'd ask them to get one. Also, be aware that some companies get terribly annoyed by people taking their data without permission. Just because it's freely available online doesn't necessarily mean you're free to reuse what you're getting.
-
Simulacrum
- Forum Newbie
- Posts: 13
- Joined: Wed Apr 13, 2005 11:58 pm
Grouper
Your approach would be more valuable if you were able to convert the data of interest into an RSS feed. Then, you have a world of new options (Plug. into a desktop reader, get notifications of new content, aggregate with other feeds, etc.).
I would suggest taking a look at the sourcecode for grouper http://www.geckotribe.com/rss/grouper/manual/, if you want to develop parsers for multiple sites. Grouper comes equipped with knowledge of how to harvest data from the Google and Yahoo news HTML pages.
I liked the Grouper solution because it *almost* reduced the knowledge required to parse a site down into an assoc. array containing a smart RegEx expression.
However, as noted, these solutions are fragile, since they depend on particular HTML patterns remaining present.
I would suggest taking a look at the sourcecode for grouper http://www.geckotribe.com/rss/grouper/manual/, if you want to develop parsers for multiple sites. Grouper comes equipped with knowledge of how to harvest data from the Google and Yahoo news HTML pages.
I liked the Grouper solution because it *almost* reduced the knowledge required to parse a site down into an assoc. array containing a smart RegEx expression.
However, as noted, these solutions are fragile, since they depend on particular HTML patterns remaining present.