Page 1 of 3

How to control Googlebots bandwidth usage?

Posted: Tue Jul 12, 2005 2:25 pm
by Swede78
Over the past few months, Googlebots seem to be over-taking one site that I made. It's good for rankings, but can really bog down the site on a daily basis. I want to control this, but without hurting rankings. Google will only slow the rate at which they visit. But, updates are done on a daily basis, and so I want them to visit everyday.

Doing a bit of research, I came across the IMS (If Modified Since) HTTP header. This seems to be exactly what I want. I'm just not sure if I'm setting this up correctly, and I don't know how to test it.

I include this function below and pass the "page last modified date" (as a Unix timestamp) to it. I've seen this code in a few different pages. I made a few minor changes. The original code used getallheaders() which doesn't work with IIS.

Code: Select all

function check_lastmod_header($UnixTimeStamp)
{
	ob_start();
	
	$MTime = $UnixTimeStamp - date("Z");
	$GMT_MTime = date('D, d M Y H:i:s', $MTime).' GMT';
	
	if( isset($_SERVER["If-Modified-Since"]) && $_SERVER["If-Modified-Since"] == $GMT_MTime )
	{
		header("HTTP/1.1 304 Not Modified");
		ob_end_clean();
		exit;
	}
	
	header("Last-Modified: ".$GMT_MTime);
	ob_end_flush();
}
But, what doesn't make seem right to me, is that it checks if the IMS date is equal to the "page last modified" date. Shouldn't it be...

Code: Select all

...
if( isset($_SERVER["If-Modified-Since"]) && $_SERVER["If-Modified-Since"] <= $GMT_MTime ) {
...
If it's correct the way it is, does anyone have any suggestions on how I can test it? I suppose I can try it and see if the google bots bog down the site. But, I'd prefer a more scientific way. Otherwise, does anyone know of a good way to solve the problem I'm having with these bots using another method?

Thanks in advance!
Swede

Posted: Thu Jul 14, 2005 1:50 pm
by Swede78
Please... any suggestions would be greatly appreciated.

Posted: Thu Jul 14, 2005 2:43 pm
by bokehman
Swede78 wrote:Please... any suggestions would be greatly appreciated.
Well as far as I can see it should be == because it is a string being compared not an integer. I have written a quick include that might help you. Don't forget php files don't send a last modified header by default so one needs to be sent with the 200 response (initial connection) so the script can check against it upon the return of the client. I have included this in my function.

Code: Select all

$file = $_SERVER['DOCUMENT_ROOT'].'/'.$_SERVER[PHP_SELF]; // Fill in path and filename
$last_modified = date("D, d M Y H:i:s \G\M\T", filemtime($file)); // Date last modified
$headers = apache_request_headers(); // Get headers sent from client browser

if(isset($headers['If-Modified-Since'])){  
	if($last_modified == $headers['If-Modified-Since']){
		header("HTTP/1.x 304 Not Modified");
                exit;
	}
}else{ 
	header("Last-Modified: $last_modified");

}

Posted: Thu Jul 14, 2005 3:12 pm
by ol4pr0
Google has huge pages dedicated for questions like that.

Posted: Thu Jul 14, 2005 4:01 pm
by bokehman
ol4pr0 wrote:Google has huge pages dedicated for questions like that.
Very helpful! What search terms should he have used?

Posted: Thu Jul 14, 2005 4:35 pm
by ol4pr0
:idea: Search term: Googlebots bandwidth usage :roll:

Posted: Thu Jul 14, 2005 5:28 pm
by bokehman
OK! Here is some more info for you.
This is legal:

$headers = apache_request_headers();
$headers['If-Modified-Since'];

and so is this:

$_SERVER['HTTP_IF_MODIFIED_SINCE']

But the following is certainly not:

$_SERVER["If-Modified-Since"]

Posted: Thu Jul 14, 2005 5:41 pm
by timvw
Below is a little snippet of my code... But i should read http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html again to verify stuff ;)

Code: Select all

<?php

$etag = isset($_SERVER['HTTP_IF_NONE_MATCH'])? $_SERVER['HTTP_IF_NONE_MATCH'] : null;
$lm = isset($_SERVER['HTTP_IF_MODIFIED_SINCE']) ? $_SERVER['HTTP_IF_MODIFIED_SINCE'] : null;
if (($etag && $etag == $pubdate) || ($lm && ($lm == $pubdate || $etag == $pubdate)))
{
    header('HTTP/1.1 304 Not Modified');
    exit();
}


header('Content-type: text/xml; charset=UTF-8');
header('Last-Modified: ' . $pubdate);
header('ETag: "' . $pubdate . '"');
echo '<?xml version="1.0" encoding="UTF-8"?>';
echo $feed;
?>

Posted: Fri Jul 15, 2005 2:35 pm
by Swede78
The first place I found the concept of doing this was posted on Google's forums. But, there was a lot of confusion among the posts. You can read the discussion here: http://www.webmasterworld.com/forum3/6005.htm

Bokehman, ahhh... I shouldn't have assumed that there was a $_SERVER['If-Modified-Since'] since I don't have Apache and had to translate the code. Thanks for catching that.

Also, you mention that it should use == because it is a string be compared and not an integer. Not quite sure I follow you there. You can compare integers with ==, can't you?

Code: Select all

if( isset($_SERVER["HTTP_If_Modified_Since"]) && $_SERVER["HTTP_If_Modified_Since"] == $GMT_MTime )
In the code above, I was wondering why <= wouldn't be better. Because, what if the "If-Modified-Since" date that google stores is older then the date it was last modified? You wouldn't want it to send a "Not Modifed" header then, right?

Also, the reason I made mine as a function, was so that I can throw in dynamic dates, as the php file itself isn't necessarily being updated.

Just a bit confused as to how this all works.


Thanks for the responses, I will give your code a try timvw.

Swede

Posted: Fri Jul 15, 2005 2:50 pm
by bokehman
Swede78 wrote:Also, you mention that it should use == because it is a string be compared and not an integer. Not quite sure I follow you there. You can compare integers with ==, can't you?

Code: Select all

if( isset($_SERVER&#1111;&quote;HTTP_If_Modified_Since&quote;]) && $_SERVER&#1111;&quote;HTTP_If_Modified_Since&quote;] == $GMT_MTime )
In the code above, I was wondering why <= wouldn't be better. Because, what if the "If-Modified-Since" date that google stores is older then the date it was last modified? You wouldn't want it to send a "Not Modifed" header then, right?
Swede
What you are comparing is a pattern and not a numeric value. Unless you convert the date to a numeric value and compare it that way you cannot use '<' or '>'.

Posted: Fri Jul 15, 2005 3:22 pm
by Swede78
Yes, true. I had a brainfart I guess. I usually work with unix timestamps or even dates in a string format of YYYY-MM-DD, which allow you to use those type of operators on. I just wasn't thinking that these are stings now. But, as you mention, you could convert the If-Modified-Since string into a number, and compare them. But, of all the samples I've seen to do this, they all use ==. So, I'll hope and assume that it's good enough.

If google stores the "If-Modified-Since" date based on the "Last-Modified" header that you pass, then seeing if they're equal will work. It will just take a while before google goes through each page and stores the correct date to use for comparison. It won't work on the first time it sees this code. But, that's ok, as long as it catches up eventually.

Also, something that Timvw may want to check with his code... I just started looking into the ETag header. It seems to me, from examples I've found, that the ETag is stored as an MD5. Just something you may want to look in to.

Posted: Fri Jul 15, 2005 6:17 pm
by Roja
Swede78 wrote: Also, something that Timvw may want to check with his code... I just started looking into the ETag header. It seems to me, from examples I've found, that the ETag is stored as an MD5. Just something you may want to look in to.
Nope. Here is the ENTIRE entry from the RFC about what an etag is:
The ETag response-header field provides the current value of the entity tag for the requested variant. The headers used with entity tags are described in sections 14.24, 14.26 and 14.44. The entity tag MAY be used for comparison with other entities from the same resource (see section 13.3.3).

ETag = "ETag" ":" entity-tag

Examples:

ETag: "xyzzy"
ETag: W/"xyzzy"
ETag: ""
Thats it. It can be literally *anything* as long as its "ETag" ":" entity-tag.

MD5 just happened to be convenient for quickly checking an entire page contents hadnt changed without knowing whether the contents changed (sounds funny, but true). As a result, most implementations use it, but its by no means required.

I am gunning for next years IATM award, Timvw. Muahahah..

Posted: Fri Jul 15, 2005 6:43 pm
by timvw
Swede78 wrote:Yes, true. I had a brainfart I guess. I usually work with unix timestamps or even dates in a string format of YYYY-MM-DD, which allow you to use those type of operators on. I just wasn't thinking that these are stings now. But, as you mention, you could convert the If-Modified-Since string into a number, and compare them. But, of all the samples I've seen to do this, they all use ==. So, I'll hope and assume that it's good enough.
Meaby you will feel better if you interpret the code like this:

- If the client sends the exact same datetime, he recieves 304.
- In all other cases (more recent, less recent) he will recieve the data.
Swede78 wrote: Also, something that Timvw may want to check with his code... I just started looking into the ETag header. It seems to me, from examples I've found, that the ETag is stored as an MD5. Just something you may want to look in to.
Before i wrote the code i also had a look at many examples that used md5 to generate an ETag.. But after reading the RFC (link in previous post) i didn't see why i would need to md5 it.

Roja wrote: MD5 just happened to be convenient for quickly checking an entire page contents hadnt changed without knowing whether the contents changed (sounds funny, but true). As a result, most implementations use it, but its by no means required.
Now i can see why i would want to md5 it ;) Never thought about the concept of passing a checksum of my content in the header :)

Roja wrote: I am gunning for next years IATM award, Timvw. Muahahah..
That's ok, but dinosaurs only exist in movies these days :))

Posted: Sat Jul 16, 2005 1:06 am
by bokehman
timvw wrote:Now i can see why i would want to md5 it ;) Never thought about the concept of passing a checksum of my content in the header :)
How would that be done?

Posted: Sat Jul 16, 2005 6:36 am
by timvw
Untested, just an idea :)

Code: Select all

ob_start();

// do stuff....

// retrieve page content
$content = ob_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output 
ob_end_flush();
ob_start();

// do stuff....

// retrieve page content
$content = ob_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output
ob_end_flush();
rieve page content
$content = ob_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output
ob_end_flush();
// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output
ob_end_flush();
_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output
ob_end_flush();
idea :)

Code: Select all

ob_start();

// do stuff....

// retrieve page content
$content = ob_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' .idea

Code: Select all

ob_start();

// do stuff....

// retrieve page content
$content = ob_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output 
ob_end_flush();
trieve page content
$content = ob_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output
ob_end_flush();
ontent = ob_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output
ob_end_flush();
();

// do stuff....

// retrieve page content
$content = ob_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output
ob_end_flush();
. $etag);

// output
ob_end_flush();
hp]
ob_start();

// do stuff....

// retrieve page content
$content = ob_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output
ob_end_flush();

// output
ob_end_flush();
idea :)

Code: Select all

ob_start();

// do stuff....

// retrieve page content
$content = ob_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output 
ob_end_flush();
ob_start();

// do stuff....

// retrieve page content
$content = ob_get_contents();

// calculate hash as etag
$etag = md5($content);
header('Etag: ' . $etag);

// output
ob_end_flush();