Page 3 of 3
Posted: Thu Jul 21, 2005 9:28 am
by Swede78
Huh? What's an address bar?
I initially started testing this code on a page located a few directories deep. It was a no-go. As I stated several posts ago, it would give me the 200 OK message the first load and when refreshing. I put your code in a new page in the root directory, and what do ya know... it worked.
I thought, ok, I guess my code doesn't work. But, out of curiousity I tried my code in a page in the root dir as well, and it worked. I would get a 304 "Undescribed" code. Not sure why it says "undescribed". But, maybe that has to do with the server. But, at least it's not a 200 OK message. So, now my code was working too (so I thought). But, as soon as I move that page to a subdir, it stops working.
So, I think my next step here (after I learn the intricacies of the refresh vs go button - just kidding) is to see whether or not my server supports if-modified-since and 304 codes. I would think both are built-in by default in IIS6, but maybe not.
Thanks, Swede
Posted: Thu Jul 21, 2005 9:40 am
by bokehman
Easiest way to check that is just to put a straight HTML document in the sub-directory and try reloading that. In the case of an HTML document you can watch the server's default behaviour and headers.
Posted: Thu Jul 21, 2005 10:12 am
by Swede78
Ok, good suggestion. Tried that, and I do get a 304 Not Modified message.
Also, I found that the sub-dir thing is not an issue. I tried those pages again, and it is working as it should. Except that the message is 304 Undescribed on my test pages. My test pages include the header code and I echo "hello" after it, just to see something. But, the code doesn't work when I attach it to the pages that I really want it to work on.
I think I know what my prob could be. My server has output buffering turned on. So, I think what is happening is that the header is getting overwritten. But, in my code I use the ob_end_clean() function, which is supposed to erase the buffer and destroy it. Hmmm... maybe this isn't working because it's being included(). Let me try putting it directly in the page....
Nope, no difference. I don't get this, but I have to figure it out. Right now, there are 624 visitors. Maybe 30 of which are not bots.
Well, bokehman (and all others). I thank you for your effort to help me. And I at least know that my server does support what I'm trying to do. I just need to figure out why it works on one page, but not the other.
I really suspect the output buffering now. But, it's not being turned off for my simple test (hello) pages. So, why do those work?
Thanks, swede
Posted: Thu Jul 21, 2005 10:23 am
by bokehman
Could you post a whole page complete with the headers script inserted so I can try it on my server to try to find out what is wrong.
Posted: Thu Jul 21, 2005 11:26 am
by Swede78
Well, my page calls several included files, all of which could potentially be causing the problem. So, I was putting all the included files directly into the page, to send to you. But, I started testing things myself. After excluding certain includes, I found that my header.php (I use these includes as a template), was causing the problem. When I did not include it, I would get the 304 code. I narrowed it down further to exactly this function causing the problem.
Code: Select all
if( !session_id() ) { session_start(); }
But, it shouldn't even be getting to this part of the page. After it sends the 304 header, the ob_end_clean() function is called and then exit(). I believe that because output_buffering is ON in php, it's still parsing the rest of the page, which is overriding the 304 header being sent. Just my guess.
I can't just remove this, as I need it. And output_buffering also can't be turned off. I think I'm screwed.
Posted: Thu Jul 21, 2005 2:34 pm
by Swede78
I was able to have output_buffering turned off. Then it worked. But, the reason output_buffering was on, was because of performance issues. With it on, pages load 5 times faster. This wasn't an issue until the server was moved to IIS6. So, I'm not sure which way to go here.
I guess I have 2 choices:
1) turn off OB to get the code working, and deal with slower page loads
or
2) leave OB on, have quicker page loads, and dynamically create and maintain plain html pages
Posted: Thu Jul 21, 2005 2:40 pm
by bokehman
I can't think of a good reason not to be using Apache. In my experience output buffering always slows thins down. Without output buffering the page starts to load. Once the browser has received the head of the page it can request the CSS. With output buffering on you have to wait for the page to be built before sending output to the browser hence it can't even request the CSS.
Posted: Thu Jul 21, 2005 8:58 pm
by josh
Not really on topic but an alternative solution to your problem is sitemaps, basically a list of when each file was last updated that google checks before it does a crawl, this way googlebot doesn't even have to bother sending a page modified header
Read up on it here:
http://google.com/webmasters/sitemaps
Posted: Thu Jul 21, 2005 9:53 pm
by Roja
bokehman wrote:In my experience output buffering always slows thins down.
In more than a few situations, it can actually speed things up.
First, understand that http travels over tcp, which is packet based. Each packet has overhead - open socket, send, receive. In other words, an additional delay for each additional send.
With buffering, you send in a steady stream, so (in theory) the packet spacing is optimized. Thats at layer 4.
Then, also understand that many browsers have to "reflow" the page if they recieve new layout information. So, while you may be showing the first half of a page, the browser may have to reflow for each of the two tables on the bottom half. Each reflow takes time, and may have a noticable effect. (Most noticable on IE, imho).
bokehman wrote:
With output buffering on you have to wait for the page to be built before sending output to the browser hence it can't even request the CSS.
Thats correct. However, for additional page loads, its going to be cached.
It all depends on what you are trying to optimize. Personally, I output buffer *everything*, so people never see a reflow, and I never have to worry about header problems.
Plus - and this is the real "for the win" moment with OB, you can also set callbacks to utf-8 encode the content, do htmlentities to ensure all entities are corrected, AND gzip so its compressed - saving even more time.
All in all, it brings quite a bit to the table.
Posted: Fri Jul 22, 2005 9:30 am
by Swede78
Not really on topic but an alternative solution to your problem is sitemaps
It definitely is on the topic if it'll help me reduce googlebot redundancy. Thank you, I'll check it out. I already have a sitemap, but I didn't know that there was a way to show googlebots which links they should visit.
As for the output buffering... as it states in the php.ini, the performance will depend on the web server. Since upgrading to Win2003/IIS6 (from 2000/IIS5), I started having page load problems. I'm sure I have some previous posts here about that as well. Turning the OB to "On" solved those problems. The difference is night and day. So, I really want to avoid having to turn it off in order to use the "if-modified-since" code.
What I've done temporarily is added code which checks if visitor's IP matches one on the several googlebots' IPs. If so, I make it sleep for 5 seconds before continuing. I hope that doesn't hurt my rankings. But, as I said, it's temporary. I plan to find a better solution over the weekend when I have more time. But, it actually is working nicely. I still had 600+ bot sessions going on, but the website was much more responsive.
My other solution, that I was considering doing before this issue even came about, is to revamp the whole site so that it creates separate plain html pages instead. That should help with google rankings as well, which was my original reason for considering doing that. But, a side effect that I could use now is that it will automatically send the 304 code when it hasn't been modified.
Thanks for the help!
Posted: Mon Feb 06, 2006 6:18 pm
by Swede78
Just wanted to get this put on the forum. The problem I had was that too many googlebots came to the site at the same time. The ideal solution for this is to control how often spiders hit or the rate at which they hit the site. I found that Yahoo has an optional parameter that you can add to your robots.txt file, to limit how fast they crawl the site. Exactly what I was hoping Google would have. Obviously, Yahoo is more in tune to the needs of webmasters.
Here's a blurb from
http://help.yahoo.com/help/us/ysearch/s ... rp-03.html
You can add a "Crawl-delay: xx" instruction, where "xx" is the a delay in seconds between successive crawler accesses. If the crawler rate is a problem for your server, you can set the delay up to 5 or 20 or a comfortable value for your server.
Setting a crawl-delay of 20 seconds for Yahoo! Slurp would look something like:
User-agent: Slurp
Crawl-delay: 20
Why can't Google do this. It seemed like no one else in this forum even heard of my problem. But, over the past several months, I've found plenty of others with the same dilemma. Unfortunately, for me, most support and fixes were only available to Apache users (by modifying httpd.conf). I use IIS, so I can't do that. There are webmasters out there who have completely banned spiders from their servers. But, that's an extreme that I can't afford. Anyway, I hope the above code helps someone out there.
Posted: Mon Feb 06, 2006 6:54 pm
by bokehman
How many hits a second does it take to stall your server? I don't understand it... I would have thought the pipe would be the bottleneck, not the server. It certainly is with my server.
Posted: Tue Feb 07, 2006 11:18 am
by Swede78
How many hits a second does it take to stall your server?
Almost any time I checked, there were at least 100-200 bots scanning the site. That would not make a very noticable difference. But, many times, it got up over 500 at a time. So, I'm not sure how many hits per second this translated into.
I don't understand it... I would have thought the pipe would be the bottleneck, not the server. It certainly is with my server.
The bottleneck was the server processing power for me. I imagine the bottleneck will differ depending on your server and what you're doing with it. I have a couple search pages with advanced search capabilities. I use GET so that people can bookmark their searches. And I also take advantage of that by having pre-made links that will perform a search. I realize that because it's possible for these searches to use wildcards, indexing the database doesn't always make a difference, and can make these searches take a few seconds. These links are available to anyone, and therefore available to the bots. So, this in itself is costing a lot of resources. But, I really didn't want to lose that functionality.
I've done a few things over the past several months, and now my server is running great. First, as suggested by someone here in this discussion, I make sure that googlebots do not start a session, as they do not need them. I believe that helps keep resource hogging down, because now I don't have 500+ unnecessary sessions being handled by the server. Second, I've banned a few obvious bad bots. A lot of discussion on this is available - there are many USER AGENT lists for bad bots out there. And the last major thing I did was upgraded to a P4 with 1 GB ram. I was on a Celeron machine before with 500 MB ram. So, that in itself made a huge difference. The server has not been bogged down once since.
Thanks, Swede[/quote]
Posted: Tue Feb 07, 2006 12:11 pm
by josh
You could also try caching the results of searches