Page 1 of 2
Crawling an entire site
Posted: Wed Mar 23, 2005 10:49 pm
by anjanesh
I have written a small code to crawl an entire site and extract some data from it. This takes a lot of time to parse. I've uploaded this on my web host. I want this to run just one time.
What I would like to know is if the process take a lot of time, will my site be banned from the web host ? Is this illegal ?
Thanks
Posted: Wed Mar 23, 2005 10:59 pm
by feyd
uh.. that's a question for your host...
You can help your case by making it not use 100% processor though.. by adding sleep() or usleep() calls..
Posted: Wed Mar 23, 2005 11:16 pm
by anjanesh
1. So if I do a sleep(1) after every file_get_contents(), the process will not use the processor for the next 1 sec. This way Im safe ? - it'll not hurt other shared sites' processess ?
2. My data transfer limit is 1GB a month - does this get affected by file_get_contents() size ?
3. I have used a lot of RegExp and all are in a loop - will this consume a lot of CPU ?
Posted: Wed Mar 23, 2005 11:50 pm
by feyd
talk to your host.
Posted: Thu Mar 24, 2005 3:32 am
by onion2k
If you only need to run it once, why not run it from your home or office PC, and then upload whatever you're saving to your webspace?
Posted: Thu Mar 24, 2005 3:56 am
by anjanesh
file_get_contents() is in a long loop - its far slower on my PC as Im having a 64kbps cable line.
Posted: Thu Mar 24, 2005 3:44 pm
by Ambush Commander
Main problem is preventing the script from timing out: Can you set up your computer to like "ping" the script and have the crawling segmented into manageable segments? If you can get it down to, like, 10 seconds a request, you should be in the green.
Posted: Thu Mar 24, 2005 6:04 pm
by m3mn0n
As long as you don't break the terms of service (TOS) or membership agreement you should be fine.
Make sure they allow you to do whatever it is you want to do. If it is not clearly stated within the TOS, then ask a customer service rep.
Posted: Sat Mar 26, 2005 12:18 pm
by anjanesh
What Im having is a cheap web host $17 a yr - 10MB space + domain registration - did not find any TOS. Theres no customer support or anything. The web hosting Co is run by one single person. Hes a reseller actually. The server is in US - I think the name is mars because the emails are having that name in the detailed header. Looking at the time at the server its having -5:00 GMT.
I ran a script that extracts some info from a site - it took some 380 secs.
On a different filter : 33 min.
Are these allowed in other web hosts like the ones you host your sites on ?
I want to know what majority do allow and not allow.
Posted: Sat Mar 26, 2005 12:54 pm
by feyd
shared hosts mostly allow just so much average usage of the CPU for the server. This just prevents someone from taking up too much time on the processor thus making all sites on the server to respond slow to non-existant. I believe most set a limit of 5%. If you use too much, they will disable your account until they can contact you typically.
If this needs to run often, you need to upgrade your hosting. Either to dedicated or colocated, as the host won't care how much processor you use then, most often.
Posted: Sat Mar 26, 2005 5:18 pm
by m3mn0n
I dislike resellers generally, but in cases like this it helps because corporate policies and bureaucracy don't stop you from doing things like what you want to do.
I say go ahead and do it and if he has a problem with it, he knows how to contact you about it.
Posted: Sat Mar 26, 2005 9:11 pm
by anjanesh
feyd wrote:shared hosts mostly allow just so much average usage of the CPU for the server. This just prevents someone from taking up too much time on the processor thus making all sites on the server to respond slow to non-existant.
Will sleep(1) help after regular intervals - in the loop say ? This way my process wont be taking the entire CPU usage all at once.
feyd wrote:I believe most set a limit of 5%.
O o - I gave set_time_limit (0);
In the long loop I tried a echo "|"; after each iteration making it look like a progress bar. But the entire |||||... gets output in one go after the loop is over. I need this progress so that I can know its going somewhere and not infinite loop.
{
...
echo "|";
}
Thanks
Posted: Sun Mar 27, 2005 11:05 am
by John Cartwright
There have been a few topics on progress bars..
Search "Progress bar" with all terms selected.
Posted: Sun Mar 27, 2005 12:06 pm
by anjanesh
flush() is not working but flush();ob_flush(); is.
But checking the site
http://php.net/manual/en/function.ob-flush.php
it looks like as if ob_flush(); wont parse the rest of the data in buffer ?
In a loop,
{
// Retrieve some details
ob_flush();
echo "|";
}
Is it possible that in the middle of forced output some of the details wont be retrieved ?
Posted: Sun Mar 27, 2005 12:09 pm
by feyd
why not write to a file or a database? That would allow for your connection to the server to disconnect and not screw up your knowledge of the script running.
I'd use a database personally.