Page 1 of 1

Robots and cookies

Posted: Sun Aug 05, 2007 8:05 am
by flycast
My site requires that a person goes and selects the state they reside in and stores it in a cookie. The problem is when there is the paranoid user that blocks cookies or robots that cannot select the state.

Has anyone run against this before? What is best practice?

My current thinking was to check for common robots in the $_SERVER['HTTP_USER_AGENT'] string and allow robots to index without redirecting to the entry form.

Any other ideas? Surely someone has solved this before.

Posted: Sun Aug 05, 2007 12:14 pm
by John Cartwright
Better yet, have your not not rely on cookies, although you can use them to fall back on.

http://php.net/session might be of interest

Posted: Sun Aug 05, 2007 2:04 pm
by flycast
Yes. I have been looking at that. Here is the downside of sessions:

You need a strategy for dealing with the transmission of the session id. Either the user has to accept a cookie with the session id in it or you need to append the session id to the end of the url. If you place the session id at the end of the url it is said that Google does not like that; they believe that the site is serving different content. Also your session id's that were served to Google show up in the index links.

Is there a third way to use sessions?

Posted: Sun Aug 05, 2007 2:49 pm
by superdezign
Are you saying that you refuse to use sessions because of a rumor you heard about Google? How many websites that use sessions do you know of that Google refuses to index?

Posted: Sun Aug 05, 2007 3:55 pm
by flycast
I can't say that I see a lot of sessions id's appended to the url. I enable cookies and use AdBlock to keep the ad trackers out. Agreed. There are a lot of rumors about Google and their black box. Some of them are pretty reasonable and some of them are pure wild conjecture from some pretty logic and common sense challenged people.

In this case the customer is very concerned that they not lose search engine standings when they upgrade to their new CMS site.

I have seen session id's in url's on Google in the past. It seems reasonable that session id's could interfere with the Google algorithm. I am just choosing a conservative approach. Anyway, putting sessions id's on the urls would be a hassle since it would take some kind of custom logic at every url (a function would do the trick) and session id's make for some very, very ugly url's.

Another reason is that when I look at session id's I get a lot of warning that it is not a good idea because of injection attacks and session hijacking. Both of these are currently a low priority possibility that anything would happen that would be bad on this site but again - being conservative.

Anyway, I think we are getting off topic. The main question at hand is how to make sure the robots can index the site when I am checking for the presence of a cookie that comes from a form selection on the entry page. The robot will be unable to select a form value that makes sense and will be locked out of the rest of the site. My thought is to check for robots in the http_user_agent and allow them to browse the site (not redirect them because they have not made a choice).

Is there a better way to do this?

Posted: Sun Aug 05, 2007 7:38 pm
by superdezign
flycast wrote:Anyway, I think we are getting off topic.
Hardly. You are refusing to take the optimum solution because of a rumor. Their concern with SEO seems to be a very inexperienced one, but your reasoning seems just as inexperienced. Any search engine worth being a part of can easily handle query strings, and the session id is not an exception to the rule. And as for the 'custom logic,' PHP can append the session id to the end of all URLs automatically.
flycast wrote:The main question at hand is how to make sure the robots can index the site when I am checking for the presence of a cookie that comes from a form selection on the entry page. The robot will be unable to select a form value that makes sense and will be locked out of the rest of the site. My thought is to check for robots in the http_user_agent and allow them to browse the site (not redirect them because they have not made a choice).

Is there a better way to do this?
Spam bots complete forms... Search engine bots do not. If you require that a form be filled out, you are essentially blocking all search engine bots from entering your website. Trying to hack it through the user agent is a bad idea because that setting is client-side and can be spoofed.

Posted: Sun Aug 05, 2007 7:45 pm
by iknownothing
I think I know what he's talking about with Google. The variables such as page=whatever etc in the URL, which is how the session will be presented, can cause issues with search engines, however, if you Google for a while, perhaps "GET variable search engine" or similar, you will find there is an easy workaround for you worries.

EDIT: Googled it myself: http://www.zend.com/zend/spotlight/searchengine.php

Posted: Mon Aug 06, 2007 7:01 am
by flycast
Trying to hack it through the user agent is a bad idea because that setting is client-side and can be spoofed.
That is why I posted. How do I keep from rejecting Google and the other legit robots when the user will get redirected to an entry page if they have not told us what state they live in?

Posted: Mon Aug 06, 2007 8:39 am
by superdezign
flycast wrote:How do I keep from rejecting Google and the other legit robots when the user will get redirected to an entry page if they have not told us what state they live in?
By not making it a requirement. Not everyone is willing to give up personal information, anyway.
Give a link to skip the step.