Looking for a web crawler
Posted: Wed Jun 03, 2009 10:46 pm
I'm looking for a multithreaded webcrawler, or something that is faster than sphider (sphider.eu).
I only need it to visit sites to a specific depth, and collect the text from the pages it visits and store it in a database.
(I don't need any search or indexing functionality.)
A solution that doesn't use php would be fine too.
I've looked at heritrix recently (the crawler used by archive.org) - but it's several times more complex than what I need.
Suggestions?
I only need it to visit sites to a specific depth, and collect the text from the pages it visits and store it in a database.
(I don't need any search or indexing functionality.)
A solution that doesn't use php would be fine too.
I've looked at heritrix recently (the crawler used by archive.org) - but it's several times more complex than what I need.
Suggestions?