Page 2 of 2

Posted: Wed Apr 05, 2006 8:26 am
by Roja
JAB Creations wrote:Legit - Declares itself as only itself and not something else.
That is a flawed definition. All (except two) normal browsers do so. So by your definition, almost all browsers are not "legit".
JAB Creations wrote:We all know the problem (most won't admit it as a problem) of the Mozilla compatible string in almost every UA.
Its NOT a problem. Its entirely valid.

You are making the assumption that the UA is intented to tell you what the browser is, and nothing else. Thats not at all accurate. The UA is intended to convey which browsing model a browser is using. That can be the Mozilla level, from the 80's and beyond, that can be the IE level, from the 90's and beyond, or it can be the browser itself, based on current practices.

Saying "Mozilla compatible" means many things. It means that it supports Javascript/Ecmascript, for one (whether it is enabled is a different story). That came about when IE entered the market, and people needed an easy way to tell the two apart. When sites started detecting based on a simplistic regex and assumption about the UA, Microsoft changed the UA to include a "Mozilla compatible" tag in their browser UA - because they had added support for Ecmascript.

Of course, the war didn't stop there, and still hasn't. Opera had to include "IE"-like tags in their UA because people were feeding it incorrect browser-specific code based on the assumption that it couldn't handle what IE could (when in fact it could).

Which brings us back to the original point of this post. By saying "I want to make assumptions about the UA, and use a simplistic regex", you are causing the very problems we are trying to warn you about.
JAB Creations wrote:If you read the original useragent string I specified the exact spider name as the other string.
And doing THAT - checking for the spider name (not the other UA elements) - is an effective method for detecting spiders, and the one that most code uses.
JAB Creations wrote:If MSIE was detected as true but not the spider then nothing happens to any application with MSIE in it's UA string. So it would only effect UAs I target.
You could (more reliably, more easily) simply detect for the spider name.
JAB Creations wrote:Now the concern would be valid if I was trying to block UAs because I did not like Mozilla in IE's UA string. But I'm not doing that.
According to your wording, and your definition of "Legit" you are. Thats why we kept asking questions. Thats why you continue to confuse and confound! :)
JAB Creations wrote:I define legit bots by those with their name (and not another bot/browser's name) and a valid and working URL. While the original UA does not have a URL in the UA it was easy to find online so it's legit.
Now you've chosen a different definition, and you are heading closer to a good choice (imho). The "not another bot/browser's name" part doesn't matter at all, and doesn't help determine anything. Once again, (almost) all browsers use another bot/browser's name in their UA, so that is a red herring. It will only confuse the logic.

Focus on the rest - define legit bots by those with their name and a valid and working URL.

With that definition, detection is easy. You check if it has an URL (and check the url), and check if it has a match of a list of spider names.

If you had a clear definition of what you wanted when you first posted, and had been consistent in that definition (it has changed three times now), we could have helped you easier.