Page 1 of 1

Designing a "price comparison" script - Pitfalls?

Posted: Mon Mar 26, 2007 9:26 am
by onion2k
I want to write a script that does something similar to a price comparison script. I'm not actually going to be comparing prices, but my project idea does need to build up a set of price data from different sites. Some have feeds that I can utilise (Amazon for example), but most don't.

As it stands my method is to crawl the 'new releases' pages using a regexp to get ids, then crawl the individual product pages every few days using another regexp to get the price. Is there a better method than that? I imagine it'll be fragile as hell.

Has anyone here written some sort of price comparator? What did you find most tricky? What should I watch out for? How did you link similar products (I'm thinking of using the name with Levenshtein distances).

Posted: Thu Mar 29, 2007 8:10 am
by Benjamin
You know, one thing that people often forget in the business world is that it's not all about prices. Communicating value is a key to success.

How can you communicate value...

1. When I order a product, and you say it will be here on the xth day of then month, will it be?
2. Will the product be as described?
3. Can I count on good customer service?
4. Is this product worth the price?
5. etc. etc. etc..

I have seen so many businesses fail because they think that they can grab all the customers by undercutting the competition, but they don't consider loyality and trust.

Just MHO

Posted: Thu Mar 29, 2007 10:23 am
by onion2k
My project is not to do with prices directly. It's to do with trends. Prices are merely the root of the data.

Posted: Thu Mar 29, 2007 12:42 pm
by Christopher
Rather than parsing the HTML you could just use the webservices feeds from sites that have it. You can get pricing back in delimited or XML that will be much easier to deal with.

Posted: Thu Mar 29, 2007 1:28 pm
by onion2k
arborint wrote:Rather than parsing the HTML you could just use the webservices feeds from sites that have it. You can get pricing back in delimited or XML that will be much easier to deal with.
Whereever a site has a feed I make full use of it. I'd be daft not to really. Where a site doesn't I crawl the site in full accordance with the site's terms, privacy policy, and robots.txt file. My current script doesn't request any content more than once per 1.75s on average (it uses a random period between 0.5s and 3s). I'm not doing anything that Google doesn't do already. It's all completely above board. This is getting off topic now.