How to control Googlebots bandwidth usage?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

timvw wrote:Untested, just an idea :)
Ok! But that kind of defeats the object because you have to build the page only to find out there has been no change. Yes, it saves on bandwidth but not on server load.
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

Well, the topic was about saving bandwith, not cpu cycles ;)

Apparently most implementations use filemtime or getlastmod... But (imho) that's not really useful if the database contents change ;)

Probably a checksum isn't a good idea after all... The layers below should handle that ;)
Swede78
Forum Contributor
Posts: 198
Joined: Wed Mar 12, 2003 12:52 pm
Location: IL

Post by Swede78 »

Well, the topic was about saving bandwith, not cpu cycles
Actually, my concern was about the server CPU usage. I stated that googlebots can "really bog down the site". Maybe I should have been more clear. The bandwidth is of little concern, cause it's still way under the alotted amount. But, when the pages should only take a fraction of a second to process, and they end up taking 5, 10, 20 seconds, it's not good.
Apparently most implementations use filemtime or getlastmod... But (imho) that's not really useful if the database contents change
Yes, this is why I made it into a function that I could simply plug in a timestamp that's taken from the database.
Ok! But that kind of defeats the object because you have to build the page only to find out there has been no change. Yes, it saves on bandwidth but not on server load.
I agree, if it has to render the page to load it into buffer and do an MD5 on it, that's defeating the purpose.


This is the code I put together mixing and matching from 3 sources that all were intended to do the same thing. Unfortunately, I just don't know how to test if it works. It's certainly not hurting, because the page that I've tried it on works fine for me. Does anyone have any suggestions on how to see if this works.

Code: Select all

function check_lastmod_header($UnixTimeStamp)
{
	ob_start();
	
	$MTime = $UnixTimeStamp - date(&quote;Z&quote;);
	$GMT_MTime = date('D, d M Y H:i:s', $MTime).' GMT';
	$ETag = '&quote;'.md5($GMT_MTime).'&quote;';
	
	$INM = isset($_SERVERї'HTTP_IF_NONE_MATCH'])? stripslashes($_SERVERї'HTTP_IF_NONE_MATCH']) : NULL;
	$IMS = isset($_SERVERї'HTTP_IF_MODIFIED_SINCE']) ? stripslashes($_SERVERї'HTTP_IF_MODIFIED_SINCE']) : NULL;
	
	if( ($INM && $INM == $ETag) || ($IMS && $IMS == $GMT_MTime) )
	{
		header(&quote;HTTP/1.1 304 Not Modified&quote;);
		ob_end_clean();
		exit;
	}
	
	header(&quote;Last-Modified: $GMT_MTime&quote;);
	header(&quote;ETag: $ETag&quote;);
	ob_end_flush();
}
Also, I'm not sure how important it is to use the ETag. Most examples and discussion I've found on this seem to only rely on comparing the If-Modified-Since date to your last modified date. The example that I found simply MD5'd the date string to use as the ETag. Someone pointed out that you can use anything as the ETag, so I don't know what the reason for MD5ing it was.

How does this code look? Any ideas on testing it?

Thanks everyone!
Swede
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

<off-topic>
The bandwidth is of little concern, cause it's still way under the alotted amount.
I can understand that bandwith is not your main concern, but i wonder you typed in the following as topic: How to control Googlebots bandwidth usage?
</off-topic>
I agree, if it has to render the page to load it into buffer and do an MD5 on it, that's defeating the purpose.
That's exactly how i do it ;)


Firefox has an extension Live HTTP headers which gives you access to the headers...
Swede78
Forum Contributor
Posts: 198
Joined: Wed Mar 12, 2003 12:52 pm
Location: IL

Post by Swede78 »

I can understand that bandwith is not your main concern, but i wonder you typed in the following as topic: How to control Googlebots bandwidth usage?
Sorry, I guess I originally figured that what I'm trying to do (keep google from using up too much server resources), would solve both problems. First, it would solve the bandwidth problem, and because you have google looking at less content, you therefore will have less cpu usage. (But, not if you have to process it to MD5 it.) I should have titled it "How to control Googlebots bandwidth/CPU usage?". :)

But, anyway... both are a concern to a lot of people. It's not the worst problem to have, as it's a lot harder to get google at your site regularly. But, the only solution they readily make available is to have them slow the rate at which they crawl your site, which ultimately lowers your rankings. Google company is smart, they have plenty of cash... I wish they would create method (via robots.txt, headers, whatever) that allowed you to control WHEN their bots visit you. I can afford to lose CPU power between 3-5am. But, during the day, it really really bogs down the server. And people are impatient when it comes to website response time. I'm that way myself.

Thanks, Swede
Swede78
Forum Contributor
Posts: 198
Joined: Wed Mar 12, 2003 12:52 pm
Location: IL

Post by Swede78 »

I'll take a look at that LiveHTTPHeaders thing for Firefox.
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

I dont really understand this. What is your bandwidth? The server should always be powerful enough that the connection is the bottleneck. Also How many google bots are visiting? It would have to be literally 100s per hour to have any effect.
Swede78
Forum Contributor
Posts: 198
Joined: Wed Mar 12, 2003 12:52 pm
Location: IL

Post by Swede78 »

I dont really understand this. What is your bandwidth? The server should always be powerful enough that the connection is the bottleneck. Also How many google bots are visiting? It would have to be literally 100s per hour to have any effect.
Well, last week, this site had about 36,500 views in 320 visits. Hmmmm... also, I have a site called become.com that had about 21,000 views in 4 visits! Don't know much about them, but they're using up a lot of resources. But, I think that number for google is pretty typical. The way I track concurrent users is by checking the number of sessions. Don't know how accurate it is, but during the hard hit times each day, it usually shows 500-600 users (sometimes up to 900). This will usually last about 10-15 minutes, 2 or 3 times per day.

If they could spread them out, it'd be ok. The site can handle 100-200 simultaneous users without any noticable affect. But, after that, it seems to slow down a bit. And when it shows 400+, then you notice quite a difference.

The host allows 100 GB, we use about half. I don't think the connection really is the bottleneck. It's the processor. 99% of the pages are php/mysql driven. The pages that are straight html will load up fine during this time.

An extreme solution to this, would be to recreate the site so that it's no longer dynamically building the content at each request, but instead builds separate html pages for everything only once when intially created and rebuilt for updates. Of course, this would be very time-consuming to re-do. But, it would probably help with rankings also, as google prefers these type of pages of dynamic.

Swede
Swede78
Forum Contributor
Posts: 198
Joined: Wed Mar 12, 2003 12:52 pm
Location: IL

Post by Swede78 »

I've added the Live HTTP Headers plugin into Firefox, and when I look at the test page that I've applied my function to, I get a "HTTP/1.x 200 OK" message each time. Googlebots somehow pass along this "if-modified-since" date to the server. How do I replicate this, so that I can see if my code works?

Thank you, Swede
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

Code: Select all

<?php
$last_modified = date("D, d M Y H:i:s \G\M\T", filemtime($_SERVER['DOCUMENT_ROOT'].$_SERVER['PHP_SELF'])); 
if(isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])){
	if($last_modified == $_SERVER['HTTP_IF_MODIFIED_SINCE']){
		header("HTTP/1.x 304 Not Modified");
		exit;
	}
}
header("Last-Modified: $last_modified");
?>
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

Swede78 wrote:Googlebots somehow pass along this "if-modified-since" date to the server.
They can only do this if you passed it to them in the first place.
Swede78
Forum Contributor
Posts: 198
Joined: Wed Mar 12, 2003 12:52 pm
Location: IL

Post by Swede78 »

Yes, I know. The code I put up is essentially like yours. But, how do you know that your code is working? How do you know it is sending the correct header?
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

Well if you look at my immediately previous post: The code doesn't produce any output of course but if you run it with 'Live HTTP Headers' the first time the page is called it returns '200' and then after a press of the refresh button it returns '304'. Try it. You've got 'Firefox' and 'Live HTTP Headers' so give it a go. Do it with my code as I know for sure it works.
Swede78
Forum Contributor
Posts: 198
Joined: Wed Mar 12, 2003 12:52 pm
Location: IL

Post by Swede78 »

Yep, tried that. And now after a couple hours of trying to figure this out, I am still stumped. This is very strange. The code works (both yours and mine), but only from the website's root dir level. If I try to use this code in a subdirectory, it doesn't work. I don't get it. Any ideas?
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

My code above works as it should at root or in a sub-directory. Make sure you are pressing the refresh button (to the left of the address bar) and not the go button (to the right of the address bar). Only the refresh button will do a reload. Pressing the go button will just display the header that relates to the previous request.
Post Reply