I have many PHP scripts that crawl sites and extract a lot of data - sometimes it takes hours for finishing (1-3 hrs depending on the site
and links).
What I would like to know is on what basis will be a violation of rules ?
The code has a loop which iterates a lot of times. Each iteration may take many minutes - but after each iteration, I have included the line usleep(15); -> sleep for 15 msecs ! Will that do any good in not using the CPU resources for 15 msecs and then back to the intensive crawling ? There are some 10 preg_match() function calls for each url pulled.
Currently my site has no problems but I want to make sure as I need to crawl many many more and this time I'll make it more automated and less manual (like
input of categories etc)
Thanks
[SOLVED] - Website crawling and extraction Limits
Moderator: General Moderators
[SOLVED] - Website crawling and extraction Limits
Last edited by anjanesh on Fri Jun 03, 2005 1:31 am, edited 1 time in total.
-
malcolmboston
- DevNet Resident
- Posts: 1826
- Joined: Tue Nov 18, 2003 1:09 pm
- Location: Middlesbrough, UK
- John Cartwright
- Site Admin
- Posts: 11470
- Joined: Tue Dec 23, 2003 2:10 am
- Location: Toronto
- Contact:
Your performance is obviously going to be less obstructed from other hosting packages.What if this was done on a dedicated server ?
I would infact suggest you move onto a dedicated server.
The longer the wait then less impact you'll have on your cpu.What if I give a value of 100ms ?
What I would suggest is compile some stats such as,
time per iteration,
total time elapsed,
etc
and play around with different usleep values to see the impact of these times. Maybe you can even opmitize your preg calls ?
-
malcolmboston
- DevNet Resident
- Posts: 1826
- Joined: Tue Nov 18, 2003 1:09 pm
- Location: Middlesbrough, UK
This is by far the most important factor you should be taking into consideration im assuming your using the same preg calls over and over again, so even a slight performance increase over 1000's of calls would save you minutes-hours of timesJcart wrote:Maybe you can even opmitize your preg calls ?
to be honest, this is the first thing that struck me about your post, i have just finished a site parser module for an app im selling that opens the site, passes it to an array, then various preg_matches get the specific info i want, i then do str_replace to restyle the content im getting back.
All this in under 0.3 seconds, 4 preg_match calls and 12 str_replaces
Is it just my coding or is that standard?
Re: Website crawling and extraction Limits
I'd be more worried about copyright issues...do you have permission to be extracting the data!anjanesh wrote:What I would like to know is on what basis will be a violation of rules ?
Re: Website crawling and extraction Limits
Least bothered abt that - thats my client's headache - his server (shared).Pimptastic wrote: I'd be more worried about copyright issues...do you have permission to be extracting the data!
But worries me is if the code will take too much CPU to have the a/c suspended by the hosting Co - this will screw me up - because it my code on his (my client's) server.