Page 2 of 2

Posted: Wed Dec 14, 2005 5:01 pm
by Swede78
I forgot to answer your question about bandwidth...
bokehman wrote:Or are you just trying to save bandwidth?
No, my host gives me much more bandwidth than I'll ever need. I'm trying to save server resources (CPU/memory). I'm trying to send the googlebots a "304 Not Modified" header and exit before the rest of the page (mysql queries, etc.) is processed. That way, they'll move on to the next page before using to much resources. And when I get enough of these googlebots, say 400 or 500 opening page after synamic page, the resources they're using up grows tremendously.

Bandwidth, not a problem. But, these googlebots keep checking the same pages everyday, bogging down the site.

I've decided to upgrade our server processor and memory in hopes that that will improve performance while googlebots make their rounds.

Posted: Wed Dec 14, 2005 6:18 pm
by bokehman
I send a 304 for unchanged content depending on etag. 304 means no content is sent. But for google I have never recieved an etag back. Here's a line from my log which shows the headers sent by googlebot:
crawl-66-249-66-2.googlebot.com [14/Dec/2005:18:54:52 +0100] Country: "US" Request: "GET / HTTP/1.1" Response: "200" Bytes: "2581" Referer: "-" User-Agent: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Cookie: "-" Accept: "*/*" Accept-Language: "-" If-None-Match: "-" Accept-Charset: "-" Accept-Encoding: "gzip" Language sent: "english" Duration: "0" As you can see If-None-Match is empty.

Posted: Wed Dec 14, 2005 9:55 pm
by Ambush Commander
I still think you're looking at the problem from the wrong perspective.
The googlebots don't crash this site. They just bog it down. For example... if a page takes 1 second to load with a normal amount of visitors, it'll take 5, 6 or 7 seconds when the googlebots come.
DOS is overloading the computational resources of a system, and a precursor to that almighty crash is a dramatic slowdown.
And I do use sessions on many pages to detect whether they're a logged in visitor or not. I don't see any other practical way to keep track of that.
If you don't have a session, you're not logged in. Simple. Of course, you'd use sessions for logged in users, but Googlebots don't log in... do they?
6 or 7 seconds to me, when I know it should be nearly instant, is too much.
I don't argue with you there.
I can see that when they come, they come in the hundreds. I've seen as many as 900 at a time.
So 900 anonymous users would slow down your site? (I'm trying to rephrase this in different terms)
On average, they'll show up with 300-400 spiders per day for an hour or so at a time.
300-400 heavy-duty users for an hour. In my opinion, any reasonable system should be able to scale to that point.
My only other solution is to start building static HTML pages instead of using dynamic pages. But, on one hand, I don't like that idea, because I'd have thousands of pages to build (and rebuild when updating). [snip]
But, all visitor-accessible pages only use simple SELECT queries. No page that the bots see should take more than a second to parse.
How do you know that? SELECT queries are absolutely devastating if not used properly. With no index, MySQL has to inspect every single row in a database. If it's truly a large website, there'll be millions of records. On a lazy day, you may squeak by, but when you get an influx of traffic, the site bogs down, hmm?

Of course, I might be totally wrong. That's why you PROFILE.
Maybe upgrading the server machine hardware would help. Now, it's on a Celeron 1.8 Ghz with 500 MB ram. This may need a memory upgrade.
Throwing hardware at the problem can help. Personally, I think you should have 1GB of memory, but I'm not a hardware hacker so I might be wrong. Remember: not all solutions scale well, and if memory isn't your problem, then you just wasted your money. I hope your purchase helps.

You should also examine your Server, PHP and MySQL logs for anomalies (i.e. long load times).

Posted: Thu Dec 15, 2005 11:29 am
by Swede78
Ambush Commander wrote:Quote:
And I do use sessions on many pages to detect whether they're a logged in visitor or not. I don't see any other practical way to keep track of that.

If you don't have a session, you're not logged in. Simple. Of course, you'd use sessions for logged in users, but Googlebots don't log in... do they?
Maybe I worded it wrong. When I said I use sessions to detect if they're logged in, I meant that I check for a specific session variable's existence, which would indicate they are logged in. Same thing as what you said, googlebots won't have user session variables. That was my point, the session issue shouldn't be causing a bog down.
Ambush Commander wrote:So 900 anonymous users would slow down your site? (I'm trying to rephrase this in different terms)
Really??? I could only wish my site was popular enough to attract 900 simultaneous users. This site typically will have 15-30 simultaneous users. Rarely does it top 50. I would guess that 900+ users to most average sites (that use dynamic content) would cause a lag in response time similar to what I'm experiencing (exceptions would include Amazon, Google, ebay, etc. - but they probably spend more than $79 a month to host their site). Don't forget that normal users don't go from page to page as quick as the server can feed them. I expect the average user actually looks at the page they're on for a minute or so.
Ambush Commander wrote:300-400 heavy-duty users for an hour. In my opinion, any reasonable system should be able to scale to that point.
Actually my site is far from crashing (I've never stated that it crashes, just bogs or slows down) when there are 300 bots scanning it. At that point, delays vary. Sometimes, the server spits out a dynamic page as quickly as when there are just few visitors. But, usually, it will take 2 or 3 seconds. And sometimes, it'll take 6 or 7 seconds. I prefer when it is instant. Maybe, I'm being too demanding of my server.
Ambush Commander wrote:Quote:
But, all visitor-accessible pages only use simple SELECT queries. No page that the bots see should take more than a second to parse.

How do you know that? SELECT queries are absolutely devastating if not used properly. With no index, MySQL has to inspect every single row in a database. If it's truly a large website, there'll be millions of records. On a lazy day, you may squeak by, but when you get an influx of traffic, the site bogs down, hmm?
Well, I know because I designed the pages. I use indexes. I don't pull data that I don't use. They're as refined as they can get. There are not millions of records, only thousands of rows in the biggest of tables. My most complicated queries that users will use (searches), involve 4 tables max, and based on performance testing I did, take 1/10 of a second to run on average. I'm not worried that the queries are written poorly.

But, I have a lot of pages. I have links that perform searches. So, googlebots are actually doing searches. And because they constantly check the same pages over and over again. I don't see why this is hard to believe that a well programmed site wouldn't bog down if 300-900 bots are performing even simple queries that take only .1 or even .01 seconds to run.
Ambush Commander wrote:Throwing hardware at the problem can help. Personally, I think you should have 1GB of memory, but I'm not a hardware hacker so I might be wrong. Remember: not all solutions scale well, and if memory isn't your problem, then you just wasted your money. I hope your purchase helps.

You should also examine your Server, PHP and MySQL logs for anomalies (i.e. long load times).
I'm hoping an upgrade in hardware will help. I agree, I think a Gig of memory is much more suitable for a server. This server also handles the email as well. So, the more memory, the better. And I haven't really followed what's new in the processor world. But, Celerons have always seemed to have performance issues compared to Pentiums. So, maybe going to a P4 will make a difference.

I constantly check my PHP logs, but I never check MySQL logs. I'll take a look at those.


Thank you, Ambush, for your feedback on this. I do appreciate it greatly.

Posted: Thu Dec 15, 2005 2:30 pm
by Ambush Commander
All I'm doing is throwing out random ideas on what could be going wrong, and you dispute them with your own knowledge.

Let's get these facts straight:

1. This site is usually low-traffic, crawlers constitute heavy usage
2. During heavy usage periods, the site slows down, but not to the point of unusability
3. There is a performance problem

So, we know there is a performance problem, but we don't know what. I've thrown out some suggestions, and actually, I should be scolded for that. Only *you* know where the issues are coming from.

But you don't. That's why I suggest you profile. I'm not going to offer anymore possible slowdowns: this random stop 'n swop is probably incredibly wasteful and I'm glad you didn't do that. I'm not sure you exactly know what profiling does, so let me paste a sample profile output from one of my sites, generated by APD and processed by pprofd. Note that I did these profiles for fun, so there were no real slowdowns on the site yet.

(you may need to widen your screen)

Code: Select all

Trace for C:\Documents and Settings\Edward\My Documents\My Webs\wikistatus\index.php
Total Elapsed Time = 0.17
Total System Time  = 0.03
Total User Time    = 0.11


         Real         User        System             secs/    cumm
%Time (excl/cumm)  (excl/cumm)  (excl/cumm) Calls    call    s/call Name
--------------------------------------------------------------------------------------
 42.9  0.06 0.06    0.05 0.05    0.00 0.00    11     0.0043  0.0043 defined
 14.3  0.01 0.01    0.02 0.02    0.00 0.00    19     0.0008  0.0008 define
 14.3  0.01 0.01    0.02 0.02    0.00 0.00     5     0.0031  0.0031 Mapper_Service->_getObject
 14.3  0.04 0.04    0.02 0.02    0.00 0.00     1     0.0156  0.0156 mysql_connect
 14.3  0.00 0.00    0.02 0.02    0.00 0.00    46     0.0003  0.0003 is_int
  0.0  0.00 0.00    0.00 0.00    0.00 0.00     1     0.0000  0.0000 Plugin->getWebPath
  0.0  0.00 0.00    0.00 0.00    0.00 0.00     4     0.0000  0.0000 is_array
  0.0  0.00 0.01    0.00 0.02    0.00 0.00     1     0.0000  0.0156 Mapper_Service->findAll
  0.0  0.00 0.00    0.00 0.00    0.00 0.00    10     0.0000  0.0000 mysql_fetch_array
  0.0  0.00 0.01    0.00 0.02    0.00 0.00     1     0.0000  0.0156 Mapper_Service->_loadAll
  0.0  0.00 0.00    0.00 0.00    0.00 0.00     8     0.0000  0.0000 ADORecordSet_mysql->MoveNext
  0.0  0.00 0.00    0.00 0.00    0.00 0.00     5     0.0000  0.0000 Service->Service
  0.0  0.00 0.00    0.00 0.00    0.00 0.00     5     0.0000  0.0000 Mapper_Service->_doLoad
  0.0  0.00 0.01    0.00 0.02    0.00 0.00     5     0.0000  0.0031 Mapper_Service->_loadFromRow
  0.0  0.00 0.00    0.00 0.00    0.00 0.00     2     0.0000  0.0000 ADORecordSet_mysql->_fetch
  0.0  0.00 0.00    0.00 0.00    0.00 0.00     2     0.0000  0.0000 ADORecordSet_mysql->GetArray
  0.0  0.00 0.00    0.00 0.00    0.00 0.00     2     0.0000  0.0000 mysql_num_fields
  0.0  0.01 0.01    0.00 0.00    0.00 0.00     2     0.0000  0.0000 mysql_query
  0.0  0.00 0.01    0.00 0.00    0.00 0.00     2     0.0000  0.0000 ADODB_mysql->_query
  0.0  0.00 0.01    0.00 0.00    0.00 0.00     2     0.0000  0.0000 ADODB_mysql->_Execute
So it jumps out to me: defined() is taking up 42 percent of my execution time, and plus, I don't use it anywhere in my application! Further investigation reveals this (you can generate a calltree from the raw data too): these defined calls come from ADOdb, in order to ensure compatibility. If I wished to, I could remove them and hard code my customizations in the ADOdb file, however, since there is no major speed problem (0.1 seconds should be taken in context: it was taken on my 512MB memory, super-multitasking personal Windows machine), I chose not to. Mysql_connect also took a bit of time: calls to external memory resources always take a sizable chunk. Finallly, the 43 calls to is_int() are a bit dubious, and I may want to look into that too.

By profiling, I find out so much about my application, and this effect is even magnified by sites were there are performance problems (which manifest themselves when the site is under a heavier load).

Now, to answer a bit of your implementation questions:
Maybe I worded it wrong. When I said I use sessions to detect if they're logged in, I meant that I check for a specific session variable's existence, which would indicate they are logged in. Same thing as what you said, googlebots won't have user session variables. That was my point, the session issue shouldn't be causing a bog down.
If you always call session_start(), PHP is creating sessions, even if it's a googlebot.
probably spend more than $79 a month to host their site
Hmm... hosting for my sites cost $5 - $9 dollars a month. Do you mean per year? Unless this is a colo...
I have links that perform searches. So, googlebots are actually doing searches. And because they constantly check the same pages over and over again.
Is the search a datamining query which takes the longest amount of time? You may want to consider, 1. hiding all searches under forms to prevent googlebots from executing them (most sites are like that), 2. caching search results, at least for a few minutes or so (to help pagination)

Posted: Fri Dec 16, 2005 1:29 pm
by Swede78
All I'm doing is throwing out random ideas on what could be going wrong, and you dispute them with your own knowledge.
Sorry if I coming off as being argumentative.
I'm not sure you exactly know what profiling does, so let me paste a sample profile output from one of my sites, generated by APD and processed by pprofd.
No, I did not know what you meant by profiling. I've never seen that done for a page. I thought you were talking about detecting the googlebots. I really have no idea how to do that, and I don't expect you to explain. I'll google it.
If you always call session_start(), PHP is creating sessions, even if it's a googlebot.
True, I forgot about the session_start. I should not have forgotten about that, as it's one of my ways to check how many current users are on the site. But, still, what can I do about that for pages that truly need them? I guess I could check if the user is a googlebot before calling the start_session function. Would 900 simultaneous sessions bog down the average server? Actually, the pages that call the session_start are pages that google doesn't need to be. Google should obey the robots.txt file, if I add those pages there. Maybe, that'll make a difference.
Hmm... hosting for my sites cost $5 - $9 dollars a month. Do you mean per year?Unless this is a colo...
This site is hosted on a dedicated machine, along with some other sites. So, it's technically being shared, but it's not like 100 unrelated people hosting their sites on it. But, I can tell from the stats that the google traffic belongs to the site I'm working on.
Is the search a datamining query which takes the longest amount of time? You may want to consider, 1. hiding all searches under forms to prevent googlebots from executing them (most sites are like that), 2. caching search results, at least for a few minutes or so (to help pagination)
The search page is form-based. But, I have a page with links to commonly searched terms. That's how google is "performing searches". I could again disallow this page to the bots, that should solve that as well.


Thanks again! I'm going to try some of these things mentioned, and see how that affects things.

Posted: Fri Dec 16, 2005 4:56 pm
by Ambush Commander
Profiling is the biggy. It will usually tell you exactly what you have to do.