Page 1 of 1

[SOLVED] - Website crawling and extraction Limits

Posted: Mon May 16, 2005 4:13 am
by anjanesh
I have many PHP scripts that crawl sites and extract a lot of data - sometimes it takes hours for finishing (1-3 hrs depending on the site
and links).

What I would like to know is on what basis will be a violation of rules ?
The code has a loop which iterates a lot of times. Each iteration may take many minutes - but after each iteration, I have included the line usleep(15); -> sleep for 15 msecs ! Will that do any good in not using the CPU resources for 15 msecs and then back to the intensive crawling ? There are some 10 preg_match() function calls for each url pulled.

Currently my site has no problems but I want to make sure as I need to crawl many many more and this time I'll make it more automated and less manual (like
input of categories etc)

Thanks

Posted: Mon May 16, 2005 4:35 am
by malcolmboston
i would say that if u are crawling one site for between 1-3 hours, the server thye site is hosted on is obviously getting hammered and you could find yourself in trouble

3 hours crawling one site is ridiculious to be honest

Posted: Mon May 16, 2005 4:46 am
by anjanesh
Well. I'll need to be extracting that much of data.
But the usleep() function should not make it use full resource usage ? What if I give a value of 100 ? 100msecs. Shouldnt that make it ok ?
What if this was done on a dedicated server ?

Posted: Mon May 16, 2005 6:32 am
by John Cartwright
What if this was done on a dedicated server ?
Your performance is obviously going to be less obstructed from other hosting packages.
I would infact suggest you move onto a dedicated server.
What if I give a value of 100ms ?
The longer the wait then less impact you'll have on your cpu.
What I would suggest is compile some stats such as,

time per iteration,
total time elapsed,
etc

and play around with different usleep values to see the impact of these times. Maybe you can even opmitize your preg calls ?

Posted: Mon May 16, 2005 7:39 am
by malcolmboston
Jcart wrote:Maybe you can even opmitize your preg calls ?
This is by far the most important factor you should be taking into consideration im assuming your using the same preg calls over and over again, so even a slight performance increase over 1000's of calls would save you minutes-hours of times

to be honest, this is the first thing that struck me about your post, i have just finished a site parser module for an app im selling that opens the site, passes it to an array, then various preg_matches get the specific info i want, i then do str_replace to restyle the content im getting back.

All this in under 0.3 seconds, 4 preg_match calls and 12 str_replaces

Is it just my coding or is that standard?

Re: Website crawling and extraction Limits

Posted: Mon May 16, 2005 8:05 am
by JayBird
anjanesh wrote:What I would like to know is on what basis will be a violation of rules ?
I'd be more worried about copyright issues...do you have permission to be extracting the data!

Re: Website crawling and extraction Limits

Posted: Mon May 16, 2005 8:18 am
by anjanesh
Pimptastic wrote: I'd be more worried about copyright issues...do you have permission to be extracting the data!
Least bothered abt that - thats my client's headache - his server (shared).
But worries me is if the code will take too much CPU to have the a/c suspended by the hosting Co - this will screw me up - because it my code on his (my client's) server.