Page 2 of 3
Posted: Sat Jul 16, 2005 6:45 am
by bokehman
timvw wrote:Untested, just an idea

Ok! But that kind of defeats the object because you have to build the page only to find out there has been no change. Yes, it saves on bandwidth but not on server load.
Posted: Sat Jul 16, 2005 7:22 am
by timvw
Well, the topic was about saving bandwith, not cpu cycles
Apparently most implementations use
filemtime or
getlastmod... But (imho) that's not really useful if the database contents change
Probably a checksum isn't a good idea after all... The layers below should handle that

Posted: Tue Jul 19, 2005 10:49 am
by Swede78
Well, the topic was about saving bandwith, not cpu cycles
Actually, my concern was about the server CPU usage. I stated that googlebots can "really bog down the site". Maybe I should have been more clear. The bandwidth is of little concern, cause it's still way under the alotted amount. But, when the pages should only take a fraction of a second to process, and they end up taking 5, 10, 20 seconds, it's not good.
Apparently most implementations use filemtime or getlastmod... But (imho) that's not really useful if the database contents change
Yes, this is why I made it into a function that I could simply plug in a timestamp that's taken from the database.
Ok! But that kind of defeats the object because you have to build the page only to find out there has been no change. Yes, it saves on bandwidth but not on server load.
I agree, if it has to render the page to load it into buffer and do an MD5 on it, that's defeating the purpose.
This is the code I put together mixing and matching from 3 sources that all were intended to do the same thing. Unfortunately, I just don't know how to test if it works. It's certainly not hurting, because the page that I've tried it on works fine for me. Does anyone have any suggestions on how to see if this works.
Code: Select all
function check_lastmod_header($UnixTimeStamp)
{
ob_start();
$MTime = $UnixTimeStamp - date("e;Z"e;);
$GMT_MTime = date('D, d M Y H:i:s', $MTime).' GMT';
$ETag = '"e;'.md5($GMT_MTime).'"e;';
$INM = isset($_SERVERї'HTTP_IF_NONE_MATCH'])? stripslashes($_SERVERї'HTTP_IF_NONE_MATCH']) : NULL;
$IMS = isset($_SERVERї'HTTP_IF_MODIFIED_SINCE']) ? stripslashes($_SERVERї'HTTP_IF_MODIFIED_SINCE']) : NULL;
if( ($INM && $INM == $ETag) || ($IMS && $IMS == $GMT_MTime) )
{
header("e;HTTP/1.1 304 Not Modified"e;);
ob_end_clean();
exit;
}
header("e;Last-Modified: $GMT_MTime"e;);
header("e;ETag: $ETag"e;);
ob_end_flush();
}
Also, I'm not sure how important it is to use the ETag. Most examples and discussion I've found on this seem to only rely on comparing the If-Modified-Since date to your last modified date. The example that I found simply MD5'd the date string to use as the ETag. Someone pointed out that you can use anything as the ETag, so I don't know what the reason for MD5ing it was.
How does this code look? Any ideas on testing it?
Thanks everyone!
Swede
Posted: Tue Jul 19, 2005 10:55 am
by timvw
<off-topic>
The bandwidth is of little concern, cause it's still way under the alotted amount.
I can understand that bandwith is not your main concern, but i wonder you typed in the following as topic: How to control Googlebots bandwidth usage?
</off-topic>
I agree, if it has to render the page to load it into buffer and do an MD5 on it, that's defeating the purpose.
That's exactly how i do it
Firefox has an extension
Live HTTP headers which gives you access to the headers...
Posted: Tue Jul 19, 2005 11:57 am
by Swede78
I can understand that bandwith is not your main concern, but i wonder you typed in the following as topic: How to control Googlebots bandwidth usage?
Sorry, I guess I originally figured that what I'm trying to do (keep google from using up too much server resources), would solve both problems. First, it would solve the bandwidth problem, and because you have google looking at less content, you therefore will have less cpu usage. (But, not if you have to process it to MD5 it.) I should have titled it "How to control Googlebots bandwidth
/CPU usage?".
But, anyway... both are a concern to a lot of people. It's not the worst problem to have, as it's a lot harder to get google at your site regularly. But, the only solution they readily make available is to have them slow the rate at which they crawl your site, which ultimately lowers your rankings. Google company is smart, they have plenty of cash... I wish they would create method (via robots.txt, headers, whatever) that allowed you to control WHEN their bots visit you. I can afford to lose CPU power between 3-5am. But, during the day, it really really bogs down the server. And people are impatient when it comes to website response time. I'm that way myself.
Thanks, Swede
Posted: Tue Jul 19, 2005 11:58 am
by Swede78
I'll take a look at that LiveHTTPHeaders thing for Firefox.
Posted: Tue Jul 19, 2005 1:12 pm
by bokehman
I dont really understand this. What is your bandwidth? The server should always be powerful enough that the connection is the bottleneck. Also How many google bots are visiting? It would have to be literally 100s per hour to have any effect.
Posted: Tue Jul 19, 2005 3:12 pm
by Swede78
I dont really understand this. What is your bandwidth? The server should always be powerful enough that the connection is the bottleneck. Also How many google bots are visiting? It would have to be literally 100s per hour to have any effect.
Well, last week, this site had about 36,500 views in 320 visits. Hmmmm... also, I have a site called become.com that had about 21,000 views in 4 visits! Don't know much about them, but they're using up a lot of resources. But, I think that number for google is pretty typical. The way I track concurrent users is by checking the number of sessions. Don't know how accurate it is, but during the hard hit times each day, it usually shows 500-600 users (sometimes up to 900). This will usually last about 10-15 minutes, 2 or 3 times per day.
If they could spread them out, it'd be ok. The site can handle 100-200 simultaneous users without any noticable affect. But, after that, it seems to slow down a bit. And when it shows 400+, then you notice quite a difference.
The host allows 100 GB, we use about half. I don't think the connection really is the bottleneck. It's the processor. 99% of the pages are php/mysql driven. The pages that are straight html will load up fine during this time.
An extreme solution to this, would be to recreate the site so that it's no longer dynamically building the content at each request, but instead builds separate html pages for everything only once when intially created and rebuilt for updates. Of course, this would be very time-consuming to re-do. But, it would probably help with rankings also, as google prefers these type of pages of dynamic.
Swede
Posted: Tue Jul 19, 2005 5:19 pm
by Swede78
I've added the Live HTTP Headers plugin into Firefox, and when I look at the test page that I've applied my function to, I get a "HTTP/1.x 200 OK" message each time. Googlebots somehow pass along this "if-modified-since" date to the server. How do I replicate this, so that I can see if my code works?
Thank you, Swede
Posted: Tue Jul 19, 2005 6:01 pm
by bokehman
Code: Select all
<?php
$last_modified = date("D, d M Y H:i:s \G\M\T", filemtime($_SERVER['DOCUMENT_ROOT'].$_SERVER['PHP_SELF']));
if(isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])){
if($last_modified == $_SERVER['HTTP_IF_MODIFIED_SINCE']){
header("HTTP/1.x 304 Not Modified");
exit;
}
}
header("Last-Modified: $last_modified");
?>
Posted: Tue Jul 19, 2005 6:04 pm
by bokehman
Swede78 wrote:Googlebots somehow pass along this "if-modified-since" date to the server.
They can only do this if you passed it to them in the first place.
Posted: Wed Jul 20, 2005 9:37 am
by Swede78
Yes, I know. The code I put up is essentially like yours. But, how do you know that your code is working? How do you know it is sending the correct header?
Posted: Wed Jul 20, 2005 10:16 am
by bokehman
Well if you look at my immediately previous post: The code doesn't produce any output of course but if you run it with 'Live HTTP Headers' the first time the page is called it returns '200' and then after a press of the refresh button it returns '304'. Try it. You've got 'Firefox' and 'Live HTTP Headers' so give it a go. Do it with my code as I know for sure it works.
Posted: Wed Jul 20, 2005 5:59 pm
by Swede78
Yep, tried that. And now after a couple hours of trying to figure this out, I am still stumped. This is very strange. The code works (both yours and mine), but only from the website's root dir level. If I try to use this code in a subdirectory, it doesn't work. I don't get it. Any ideas?
Posted: Thu Jul 21, 2005 4:52 am
by bokehman
My code above works as it should at root or in a sub-directory. Make sure you are pressing the refresh button (to the left of the address bar) and not the go button (to the right of the address bar). Only the refresh button will do a reload. Pressing the go button will just display the header that relates to the previous request.