Page 1 of 1

Parsing blogs (grab the Title and Date)

Posted: Fri Nov 21, 2008 11:12 pm
by samsono
I not familar with xml and how feeds work. So what I was trying to do was create a page scraper for blogs, and get the Title and Date of the post, by looking at the layout patterns. I found with Blogger and Typepad all their pages followed the same format make it easy to crawl, but WordPress blogs where completely inconsistent from blog to blog and version to version.

What I wanted to know is it possible and if so how would I be able to enter in the url of a particular post and get the title and date of the post, if the page is RSS supported. And not simply just the most recent posts but any post between any date(2004,2005..2008).

Re: Parsing blogs (grab the Title and Date)

Posted: Sat Nov 22, 2008 5:36 am
by koen.h
Normally the rss or atom feeds follow a strict format (some obligated elements, some optional). If you see difference in Wordpress feeds that's because they use a different version or options.

http://en.wikipedia.org/wiki/Web_feed