Crawling an entire site

Ye' old general discussion board. Basically, for everything that isn't covered elsewhere. Come here to shoot the breeze, shoot your mouth off, or whatever suits your fancy.
This forum is not for asking programming related questions.

Moderator: General Moderators

User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

Crawling an entire site

Post by anjanesh »

I have written a small code to crawl an entire site and extract some data from it. This takes a lot of time to parse. I've uploaded this on my web host. I want this to run just one time.
What I would like to know is if the process take a lot of time, will my site be banned from the web host ? Is this illegal ?
Thanks
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

uh.. that's a question for your host...

You can help your case by making it not use 100% processor though.. by adding sleep() or usleep() calls..
User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

Post by anjanesh »

1. So if I do a sleep(1) after every file_get_contents(), the process will not use the processor for the next 1 sec. This way Im safe ? - it'll not hurt other shared sites' processess ?
2. My data transfer limit is 1GB a month - does this get affected by file_get_contents() size ?
3. I have used a lot of RegExp and all are in a loop - will this consume a lot of CPU ?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

talk to your host.
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

If you only need to run it once, why not run it from your home or office PC, and then upload whatever you're saving to your webspace?
User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

Post by anjanesh »

file_get_contents() is in a long loop - its far slower on my PC as Im having a 64kbps cable line.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Main problem is preventing the script from timing out: Can you set up your computer to like "ping" the script and have the crawling segmented into manageable segments? If you can get it down to, like, 10 seconds a request, you should be in the green.
User avatar
m3mn0n
PHP Evangelist
Posts: 3548
Joined: Tue Aug 13, 2002 3:35 pm
Location: Calgary, Canada

Post by m3mn0n »

As long as you don't break the terms of service (TOS) or membership agreement you should be fine.

Make sure they allow you to do whatever it is you want to do. If it is not clearly stated within the TOS, then ask a customer service rep.
User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

Post by anjanesh »

What Im having is a cheap web host $17 a yr - 10MB space + domain registration - did not find any TOS. Theres no customer support or anything. The web hosting Co is run by one single person. Hes a reseller actually. The server is in US - I think the name is mars because the emails are having that name in the detailed header. Looking at the time at the server its having -5:00 GMT.
I ran a script that extracts some info from a site - it took some 380 secs.
On a different filter : 33 min.
Are these allowed in other web hosts like the ones you host your sites on ?
I want to know what majority do allow and not allow.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

shared hosts mostly allow just so much average usage of the CPU for the server. This just prevents someone from taking up too much time on the processor thus making all sites on the server to respond slow to non-existant. I believe most set a limit of 5%. If you use too much, they will disable your account until they can contact you typically.

If this needs to run often, you need to upgrade your hosting. Either to dedicated or colocated, as the host won't care how much processor you use then, most often.
User avatar
m3mn0n
PHP Evangelist
Posts: 3548
Joined: Tue Aug 13, 2002 3:35 pm
Location: Calgary, Canada

Post by m3mn0n »

I dislike resellers generally, but in cases like this it helps because corporate policies and bureaucracy don't stop you from doing things like what you want to do.

I say go ahead and do it and if he has a problem with it, he knows how to contact you about it.
User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

Post by anjanesh »

feyd wrote:shared hosts mostly allow just so much average usage of the CPU for the server. This just prevents someone from taking up too much time on the processor thus making all sites on the server to respond slow to non-existant.
Will sleep(1) help after regular intervals - in the loop say ? This way my process wont be taking the entire CPU usage all at once.
feyd wrote:I believe most set a limit of 5%.
O o - I gave set_time_limit (0);

In the long loop I tried a echo "|"; after each iteration making it look like a progress bar. But the entire |||||... gets output in one go after the loop is over. I need this progress so that I can know its going somewhere and not infinite loop.
{
...
echo "|";
}

Thanks
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

There have been a few topics on progress bars..

Search "Progress bar" with all terms selected.
User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

Post by anjanesh »

flush() is not working but flush();ob_flush(); is.
But checking the site http://php.net/manual/en/function.ob-flush.php
it looks like as if ob_flush(); wont parse the rest of the data in buffer ?
In a loop,
{
// Retrieve some details
ob_flush();
echo "|";
}
Is it possible that in the middle of forced output some of the details wont be retrieved ?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

why not write to a file or a database? That would allow for your connection to the server to disconnect and not screw up your knowledge of the script running.

I'd use a database personally.
Post Reply