Page 1 of 2
Problem with output buffering
Posted: Tue Dec 06, 2005 5:40 pm
by Swede78
A while back I started a discussion about controlling googlebot traffic:
viewtopic.php?t=35498&highlight=
My problem was too many bots coming to one of my sites at the same time, and causing the server to bog down. Based on suggestions from google's forum, I found some code that basically sent headers that told the googlebots that the page was not new. This code didn't work for me at the time, because I had output_buffering turned ON in my php.ini file. This apparently caused the "304 Not Modified" headers to be overwritten by PHP's default header creation.
Well, it's been awhile since I've worked on this dilemma. But, recently, I turned output_buffering OFF, and the code does actually work. However, I'm having one problem that I can't solve. If I try to start a session after this function is called, I get an error warning that the header has already been sent and it can't do that. It's just a warning, and doesn't seem to cause problems with the page itself. But, I don't like having my error log fill with these warnings. So, I can't just let it be.
Code: Select all
// $LastMod_UnixTS is a timestamp of the last modification of the page
function check_lastmod_header($LastMod_UnixTS)
{
ob_start();
$MTime = $LastMod_UnixTS - date("Z");
$GMT_MTime = date('D, d M Y H:i:s', $MTime).' GMT';
$ETag = '"'.md5($GMT_MTime).'"';
$INM = isset($_SERVER['HTTP_IF_NONE_MATCH'])? stripslashes($_SERVER['HTTP_IF_NONE_MATCH']) : NULL;
$IMS = isset($_SERVER['HTTP_IF_MODIFIED_SINCE']) ? stripslashes($_SERVER['HTTP_IF_MODIFIED_SINCE']) : NULL;
if( ($INM && $INM == $ETag) || ($IMS && $IMS == $GMT_MTime) )
{
header('Status: Not Modified', true, 304);
ob_end_clean();
exit();
}
header("Last-Modified: $GMT_MTime");
header("ETag: $ETag");
ob_end_flush();
}
My question is... can I have output_buffering turned on, write a header (like above), and get PHP to not overwrite that header? OR, can I keep output_buffering off, use this function and start a session after it without getting an error?
If I start the session before this function is called, the function just doesn't work. I'm really close to just giving up. The only other solution I can think of, is to check if the visitor is a googlebot through HTTP_USER_AGENT and feed them a different response. I've read that this can upset the google gods, and hurt page rankings. How they would know is beyond me, but google certainly has the resources, so I wouldn't doubt that it's possible.
Thanks in advanced,
Swede
Posted: Wed Dec 07, 2005 4:21 am
by foobar
This is a strange error you've got there. Try putting ob_start() after you set the session variables.
Posted: Wed Dec 07, 2005 10:25 am
by John Cartwright
Moved to PHP-Code.
Posted: Wed Dec 07, 2005 5:00 pm
by Swede78
foobar wrote:This is a strange error you've got there. Try putting ob_start() after you set the session variables.
I did try that, and then the session error doesn't happen. Which makes sense, I understand that you can't start a session after the header has been already been sent. But, If I start the session before calling this function, then the function itself doesn't work properly. What happens is that it does not give the "304 Not Modified" header when it should (after reloading a page that hasn't changed).
I think I have a catch-22 situation here. Either use this function to keep the googlebot overload down to a minimum and have tons of log errors, or let them infest the site error-free.
Thanks, Swede
Posted: Wed Dec 07, 2005 5:31 pm
by Ambush Commander
I'd suggest doing some hardcore profiling and tuning.
Posted: Fri Dec 09, 2005 9:08 am
by Swede78
Profiling? Of what, the googlebots? And what do you mean by hardcore? Is there any other way to detect them besides using http_user_agent? What do you suggest I tune that would help solve my problem? Please be a bit more specific.
Thanks, Swede
Posted: Fri Dec 09, 2005 9:12 am
by Ambush Commander
Basically, having Googlebots invade your system shouldn't cause it to crash. This is indicative of two problems:
1. Rendering pages takes too long
2. Your average page size is too big (not including imgs: I don't think google grabs those)
You are trying to hide a more fundamental problem. What if your server gets Slashdotted?
I don't know what to tune. That's why you profile to find the slow spots, and then you tune. Tune the slowest parts first.
Profiling will be a little tricky depending on the configuration of your host: you may have to do it on a subpar machine locally. I've had success with APD, but it was a bugger to install and configure on my machine, so I'll be more than happy to help.
Posted: Fri Dec 09, 2005 9:40 am
by Roja
Ambush Commander wrote:Basically, having Googlebots invade your system shouldn't cause it to crash. This is indicative of two problems:
1. Rendering pages takes too long
2. Your average page size is too big (not including imgs: I don't think google grabs those)
You are oversimplifying. The googlebots have a well-known high-impact on sites. A great example is phpbb.. when google spiders, it can cause the session table to fill up in a matter of minutes, because it gets a new session id for each spider. With thousands of spiders, boom, the session table is overloaded.
Thats not time to render, and thats not a page size that is too large.
However, the OP is having truly bizarre behavior in multiple categories, so I can't hope to offer a solution.
Posted: Fri Dec 09, 2005 9:43 am
by Ambush Commander
A great example is phpbb.. when google spiders, it can cause the session table to fill up in a matter of minutes, because it gets a new session id for each spider.
Does that mean I can DOS any site with phpBB on it? I never have agreed with phpBB's approach: issuing session cookies for everyone (I mean, can't an anonymous reader get by without a session?) Here is a performance issue bundled to it.
Thats not time to render, and thats not a page size that is too large.
That is true. I totally forgot about Session.
Posted: Fri Dec 09, 2005 9:53 am
by Roja
Ambush Commander wrote:A great example is phpbb.. when google spiders, it can cause the session table to fill up in a matter of minutes, because it gets a new session id for each spider.
Does that mean I can DOS any site with phpBB on it? I never have agreed with phpBB's approach: issuing session cookies for everyone (I mean, can't an anonymous reader get by without a session?) Here is a performance issue bundled to it.
An anonymous reader can get by without a session. However, by default, php adds session tags (trans_sid) to the links, and google follows them. With hundreds (thousands?) of bots hitting at once, poof.
As to whether you DOS any site with it, probably not. Many sites have added in a fix or a plugin that prevents it. Olympia (the upcoming version of PHPBB) has a different method for dealing with sessions. And finally, you'd have to follow the links the same way that google does, and at the same speed - if they didn't have any of those fixes in place.
But yes, by default, if you do the same thing google does, in my experience it will take phpbb down.
Ambush Commander wrote:
Thats not time to render, and thats not a page size that is too large.
That is true. I totally forgot about Session.
There is also a page that renders quickly, and loads quickly, but that pops a pile of memory. Imagine a simple php page that defines a 100,000 element array, and then does "hello world".
There are a number of underlying problems that mass connections can uncover, ranging from memory, server response (http OR db), to network issues, and more. All of which just shores up your point - its not a "googlebot" problem, its a fundamental site problem. He will definitely need a robust solution that fixes the issue for all sites - or slashdot will cripple him!
Posted: Fri Dec 09, 2005 11:04 am
by Swede78
Wow... you guys are quick to respond. The googlebots don't crash this site. They just bog it down. For example... if a page takes 1 second to load with a normal amount of visitors, it'll take 5, 6 or 7 seconds when the googlebots come. And I do use sessions on many pages to detect whether they're a logged in visitor or not. I don't see any other practical way to keep track of that. 6 or 7 seconds to me, when I know it should be nearly instant, is too much. Maybe the dial-up users won't see any difference. But, I really think that a lot of users would get fed up with the slowness while googlebots are there.
I can see that when they come, they come in the hundreds. I've seen as many as 900 at a time. The site still doesn't crash, but it certainly slows it down to a crawl. On average, they'll show up with 300-400 spiders per day for an hour or so at a time. Google is the only one that does this, the other bots, MSN, etc. never bog it down. I've tried contacting google, they will only decrease the frequency of when they come, not how many visits at once. Which may be an option, but I want google to come, as everyone who wants better ranks does.
My only other solution is to start building static HTML pages instead of using dynamic pages. But, on one hand, I don't like that idea, because I'd have thousands of pages to build (and rebuild when updating). Not to mention that doing so is quite a large project in itself. And if I ever just wanted to change a menu link or graphic, it'd take so long to re-build every page. On the other hand, maybe that wouldn't be such a bad idea, since it could be done automatically. It'd just be a once-in-a-while complete re-build. Also, not only would this help performance, it would be more google-friendly in two ways. First, "Not Modified" headers would be automatic with plain html pages. And second, I believe google rates html pages higher than dynamic pages that have a request variable in the url.
I just thought of something. Not only would I have about 6000 "detail" pages to build, each of these has 8 ways to resort the data, and some have multiple rows of data that I split into x amount of subpages. I'd be looking at possibly 100,000 html pages to build/re-build. This in itself could possibly bog the system down. Most of these pages would not change after their initial build. But, like I said, any minor graphic change or link change, would require a complete rebuild of all the html pages. Not sure if it would be wise to do this.
Most of my pages use mysql queries. But, all visitor-accessible pages only use simple SELECT queries. No page that the bots see should take more than a second to parse. So, I don't think I can do much to improve performance (besides converting everything to use pre-built HTML pages).
Maybe upgrading the server machine hardware would help. Now, it's on a Celeron 1.8 Ghz with 500 MB ram. This may need a memory upgrade.
Thanks for the replies!
Posted: Fri Dec 09, 2005 5:40 pm
by bokehman
Sorry to be dissapointing but Google doesn't send either of those headers. I know this as I keep a custom Apache log containing those two items. I've also run a regex on one of my static sites going back 8 months and google has never been sent a 304 (indicating the same thing).
Posted: Tue Dec 13, 2005 1:39 pm
by Swede78
I wouldn't be able to test that myself. All I know is that the original instructions for doing this actually come from google. They're the ones who recommend this as a way to tell their spiders that they've already been to a dynamic page that hasn't changed. When I had that code running, it did make a difference. I was getting those "Can't send headers" warnings, but I could tell that the googlebots were spending much less time at a time on the site.
Posted: Tue Dec 13, 2005 2:14 pm
by bokehman
How are you checking the page isn't modified? Or are you just trying to save bandwidth?
I use 3 lines of defence...
1) send 304 headers for unchanged pages
2) cache all pages (relevant cached pages are destroyed after database updates)
3) gzip all output
Posted: Wed Dec 14, 2005 4:50 pm
by Swede78
bokehman wrote:How are you checking the page isn't modified? Or are you just trying to save bandwidth?
I use Firefox's plugin called "Live HTTP Headers".
bokehman wrote:I use 3 lines of defence...
1) send 304 headers for unchanged pages
2) cache all pages (relevant cached pages are destroyed after database updates)
3) gzip all output
I'm a bit confused, Bokehman... a few days ago you wrote that google doesn't send those headers (referring to the code in my inital post, I believe). So, why do you send 304 headers for unchanged pages? Is that just so regular users' browsers pull a cached version from memory?
99% of my pages contain dynamic content. I don't think cacheing them would be a good idea, even if it was possible in Win/IIS on dynamic pages (correct me if I'm wrong). And compressing the data may also not be a good idea with mostly dynamic content. That would create a lot of overhead on the server to constantly be re-compressing the same pages over and over.
Thanks, Swede