Ok! But that kind of defeats the object because you have to build the page only to find out there has been no change. Yes, it saves on bandwidth but not on server load.timvw wrote:Untested, just an idea
How to control Googlebots bandwidth usage?
Moderator: General Moderators
Well, the topic was about saving bandwith, not cpu cycles 
Apparently most implementations use filemtime or getlastmod... But (imho) that's not really useful if the database contents change
Probably a checksum isn't a good idea after all... The layers below should handle that
Apparently most implementations use filemtime or getlastmod... But (imho) that's not really useful if the database contents change
Probably a checksum isn't a good idea after all... The layers below should handle that
Actually, my concern was about the server CPU usage. I stated that googlebots can "really bog down the site". Maybe I should have been more clear. The bandwidth is of little concern, cause it's still way under the alotted amount. But, when the pages should only take a fraction of a second to process, and they end up taking 5, 10, 20 seconds, it's not good.Well, the topic was about saving bandwith, not cpu cycles
Yes, this is why I made it into a function that I could simply plug in a timestamp that's taken from the database.Apparently most implementations use filemtime or getlastmod... But (imho) that's not really useful if the database contents change
I agree, if it has to render the page to load it into buffer and do an MD5 on it, that's defeating the purpose.Ok! But that kind of defeats the object because you have to build the page only to find out there has been no change. Yes, it saves on bandwidth but not on server load.
This is the code I put together mixing and matching from 3 sources that all were intended to do the same thing. Unfortunately, I just don't know how to test if it works. It's certainly not hurting, because the page that I've tried it on works fine for me. Does anyone have any suggestions on how to see if this works.
Code: Select all
function check_lastmod_header($UnixTimeStamp)
{
ob_start();
$MTime = $UnixTimeStamp - date("e;Z"e;);
$GMT_MTime = date('D, d M Y H:i:s', $MTime).' GMT';
$ETag = '"e;'.md5($GMT_MTime).'"e;';
$INM = isset($_SERVERї'HTTP_IF_NONE_MATCH'])? stripslashes($_SERVERї'HTTP_IF_NONE_MATCH']) : NULL;
$IMS = isset($_SERVERї'HTTP_IF_MODIFIED_SINCE']) ? stripslashes($_SERVERї'HTTP_IF_MODIFIED_SINCE']) : NULL;
if( ($INM && $INM == $ETag) || ($IMS && $IMS == $GMT_MTime) )
{
header("e;HTTP/1.1 304 Not Modified"e;);
ob_end_clean();
exit;
}
header("e;Last-Modified: $GMT_MTime"e;);
header("e;ETag: $ETag"e;);
ob_end_flush();
}How does this code look? Any ideas on testing it?
Thanks everyone!
Swede
<off-topic>
</off-topic>

Firefox has an extension Live HTTP headers which gives you access to the headers...
I can understand that bandwith is not your main concern, but i wonder you typed in the following as topic: How to control Googlebots bandwidth usage?The bandwidth is of little concern, cause it's still way under the alotted amount.
</off-topic>
That's exactly how i do itI agree, if it has to render the page to load it into buffer and do an MD5 on it, that's defeating the purpose.
Firefox has an extension Live HTTP headers which gives you access to the headers...
Sorry, I guess I originally figured that what I'm trying to do (keep google from using up too much server resources), would solve both problems. First, it would solve the bandwidth problem, and because you have google looking at less content, you therefore will have less cpu usage. (But, not if you have to process it to MD5 it.) I should have titled it "How to control Googlebots bandwidth/CPU usage?".I can understand that bandwith is not your main concern, but i wonder you typed in the following as topic: How to control Googlebots bandwidth usage?
But, anyway... both are a concern to a lot of people. It's not the worst problem to have, as it's a lot harder to get google at your site regularly. But, the only solution they readily make available is to have them slow the rate at which they crawl your site, which ultimately lowers your rankings. Google company is smart, they have plenty of cash... I wish they would create method (via robots.txt, headers, whatever) that allowed you to control WHEN their bots visit you. I can afford to lose CPU power between 3-5am. But, during the day, it really really bogs down the server. And people are impatient when it comes to website response time. I'm that way myself.
Thanks, Swede
Well, last week, this site had about 36,500 views in 320 visits. Hmmmm... also, I have a site called become.com that had about 21,000 views in 4 visits! Don't know much about them, but they're using up a lot of resources. But, I think that number for google is pretty typical. The way I track concurrent users is by checking the number of sessions. Don't know how accurate it is, but during the hard hit times each day, it usually shows 500-600 users (sometimes up to 900). This will usually last about 10-15 minutes, 2 or 3 times per day.I dont really understand this. What is your bandwidth? The server should always be powerful enough that the connection is the bottleneck. Also How many google bots are visiting? It would have to be literally 100s per hour to have any effect.
If they could spread them out, it'd be ok. The site can handle 100-200 simultaneous users without any noticable affect. But, after that, it seems to slow down a bit. And when it shows 400+, then you notice quite a difference.
The host allows 100 GB, we use about half. I don't think the connection really is the bottleneck. It's the processor. 99% of the pages are php/mysql driven. The pages that are straight html will load up fine during this time.
An extreme solution to this, would be to recreate the site so that it's no longer dynamically building the content at each request, but instead builds separate html pages for everything only once when intially created and rebuilt for updates. Of course, this would be very time-consuming to re-do. But, it would probably help with rankings also, as google prefers these type of pages of dynamic.
Swede
I've added the Live HTTP Headers plugin into Firefox, and when I look at the test page that I've applied my function to, I get a "HTTP/1.x 200 OK" message each time. Googlebots somehow pass along this "if-modified-since" date to the server. How do I replicate this, so that I can see if my code works?
Thank you, Swede
Thank you, Swede
Code: Select all
<?php
$last_modified = date("D, d M Y H:i:s \G\M\T", filemtime($_SERVER['DOCUMENT_ROOT'].$_SERVER['PHP_SELF']));
if(isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])){
if($last_modified == $_SERVER['HTTP_IF_MODIFIED_SINCE']){
header("HTTP/1.x 304 Not Modified");
exit;
}
}
header("Last-Modified: $last_modified");
?>Well if you look at my immediately previous post: The code doesn't produce any output of course but if you run it with 'Live HTTP Headers' the first time the page is called it returns '200' and then after a press of the refresh button it returns '304'. Try it. You've got 'Firefox' and 'Live HTTP Headers' so give it a go. Do it with my code as I know for sure it works.
My code above works as it should at root or in a sub-directory. Make sure you are pressing the refresh button (to the left of the address bar) and not the go button (to the right of the address bar). Only the refresh button will do a reload. Pressing the go button will just display the header that relates to the previous request.