[SOLVED] - Website crawling and extraction Limits

Ye' old general discussion board. Basically, for everything that isn't covered elsewhere. Come here to shoot the breeze, shoot your mouth off, or whatever suits your fancy.
This forum is not for asking programming related questions.

Moderator: General Moderators

Post Reply
User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

[SOLVED] - Website crawling and extraction Limits

Post by anjanesh »

I have many PHP scripts that crawl sites and extract a lot of data - sometimes it takes hours for finishing (1-3 hrs depending on the site
and links).

What I would like to know is on what basis will be a violation of rules ?
The code has a loop which iterates a lot of times. Each iteration may take many minutes - but after each iteration, I have included the line usleep(15); -> sleep for 15 msecs ! Will that do any good in not using the CPU resources for 15 msecs and then back to the intensive crawling ? There are some 10 preg_match() function calls for each url pulled.

Currently my site has no problems but I want to make sure as I need to crawl many many more and this time I'll make it more automated and less manual (like
input of categories etc)

Thanks
Last edited by anjanesh on Fri Jun 03, 2005 1:31 am, edited 1 time in total.
malcolmboston
DevNet Resident
Posts: 1826
Joined: Tue Nov 18, 2003 1:09 pm
Location: Middlesbrough, UK

Post by malcolmboston »

i would say that if u are crawling one site for between 1-3 hours, the server thye site is hosted on is obviously getting hammered and you could find yourself in trouble

3 hours crawling one site is ridiculious to be honest
User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

Post by anjanesh »

Well. I'll need to be extracting that much of data.
But the usleep() function should not make it use full resource usage ? What if I give a value of 100 ? 100msecs. Shouldnt that make it ok ?
What if this was done on a dedicated server ?
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

What if this was done on a dedicated server ?
Your performance is obviously going to be less obstructed from other hosting packages.
I would infact suggest you move onto a dedicated server.
What if I give a value of 100ms ?
The longer the wait then less impact you'll have on your cpu.
What I would suggest is compile some stats such as,

time per iteration,
total time elapsed,
etc

and play around with different usleep values to see the impact of these times. Maybe you can even opmitize your preg calls ?
malcolmboston
DevNet Resident
Posts: 1826
Joined: Tue Nov 18, 2003 1:09 pm
Location: Middlesbrough, UK

Post by malcolmboston »

Jcart wrote:Maybe you can even opmitize your preg calls ?
This is by far the most important factor you should be taking into consideration im assuming your using the same preg calls over and over again, so even a slight performance increase over 1000's of calls would save you minutes-hours of times

to be honest, this is the first thing that struck me about your post, i have just finished a site parser module for an app im selling that opens the site, passes it to an array, then various preg_matches get the specific info i want, i then do str_replace to restyle the content im getting back.

All this in under 0.3 seconds, 4 preg_match calls and 12 str_replaces

Is it just my coding or is that standard?
User avatar
JayBird
Admin
Posts: 4524
Joined: Wed Aug 13, 2003 7:02 am
Location: York, UK
Contact:

Re: Website crawling and extraction Limits

Post by JayBird »

anjanesh wrote:What I would like to know is on what basis will be a violation of rules ?
I'd be more worried about copyright issues...do you have permission to be extracting the data!
User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

Re: Website crawling and extraction Limits

Post by anjanesh »

Pimptastic wrote: I'd be more worried about copyright issues...do you have permission to be extracting the data!
Least bothered abt that - thats my client's headache - his server (shared).
But worries me is if the code will take too much CPU to have the a/c suspended by the hosting Co - this will screw me up - because it my code on his (my client's) server.
Post Reply