Page 1 of 1

ultimate email validation function

Posted: Thu Mar 04, 2004 1:56 pm
by Pyrite
May be this isn't the place to post this, and may be this has already been asked somewhere, but I didn't see it and there was nothing in the code sniplets forum about it. So forgive me if i am out of line, but here goes.

I need an ultimate email validation function. All the ones I have seen so far only do like half of what I need. so if any of you have a great email validation function, please let me know or direct me where it is posted elsewhere.

I need it to validate only the "+.-_0-9a-z" as defined in:
http://www.remote.org/jochen/mail/info/chars.html
I've seen some that validate for 0-9, a-z, but I need it to check and only allow for those and period, underscore, hyphen and plus sign too.

Also, so that it validates the format of the email address, and allows for country codes at the end (ie. bob@pacific.net.sg or bob@hcm.vnn.vn) and allows for all the types of domains like .info, .biz, .tv, .cx etc.

Also, it only allows one @ in the address, and makes sure there is at least one period in the address. And makes sure there isn't two periods used next to each other (ie. bob@hotmail..com) . And makes sure only ascii 7bit characters are used (ie. no chinese or japanese characters etc).

These are a few of the situations the function should check for, if anyone has any others to add. Anybody have such a function?

Posted: Thu Mar 04, 2004 2:05 pm
by Roja
What you are describing is NOT an RFC-compliant test for email validation.

The RFC also allows apostrophes, and multiple other complex-to-grep-for conditions (single quotes, double quotes, only one double quote, and much.. much more).

However - people much wiser than you or I have done the work. Jeffrey E.F. Friedl wrote "Mastering Regular Expressions" which has the definitive "best-in-the-world" attempt at getting REALLY DARNED CLOSE to RFC-compliant email validation.

More thankfully, a developer group (killersoft) even translated it to a php version. I've provided the link to that download. Multiple other sites had similar translations that were under less restrictive licenses. I found a site that had it BSDL'd, and another that had it GPL'd. I can't seem to find them easily at the moment, so killersoft's site will have to do. Be aware of their license. :(

At any rate, that function is as good as you can get in terms of validating the email FORMAT.

Posted: Thu Mar 04, 2004 2:13 pm
by Pyrite
Can you please tell me what part about what I was describing is not RFC complient for email addresses ?

Also, that function is copyrighted, and I need to be able to distribute the code for such a function in a pretty close to non-restrictive manner.

Posted: Thu Mar 04, 2004 2:58 pm
by Pyrite
Also, that script validates these addresses as valid/true. And we all know they aren't.

"betty@hotmail"
"betty&@hotmail"
" betty@hotmail.com"

Posted: Thu Mar 04, 2004 7:33 pm
by redmonkey
Unfortunately the one I came up with is part of a copyrighted app I'm developing so I am unable to give you that. However, I have converted it from PCRE to use the ereg function (doing so has lost a bit of robustness from the expression) but it still works quite well (I think, I haven't given it much testing). Feel free to try it out and use it if it suits your needs.

Code: Select all

function is_valid_email($address)
{

	if (!ereg("^([-a-zA-Z0-9_\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$", $address)) {
		return false;
	}
	return true;
}
Example usage...

Code: Select all

$email = 'bob@pacific.net.sg';

if (is_valid_email($email)) {
	
	echo "{$email} valid address.\n";
}else {
	
	echo "{$email} format not recognised.\n";
}

Posted: Thu Mar 04, 2004 9:12 pm
by Pyrite
Thanks redmonkey, but yours doesn't allow for + signs, how do I add that in there, the whole regexp thing confuses me.

Posted: Thu Mar 04, 2004 9:20 pm
by Pyrite
Ok I figured it out. Why are you allowing the backslash \ ?

Posted: Fri Mar 05, 2004 10:33 am
by Roja
Pyrite wrote:Can you please tell me what part about what I was describing is not RFC complient for email addresses ?
I already did - you didnt allow for single apostrophes - tim.o'malley@slashdot.org is a legal RFC-compliant email address. There are dozens of others. The RFC allows *huge* amounts of variations in emails. You were trying to simplify something that ISNT easy to simplify due to the RFC.

A great page discussing the challenge is here.

* kevin@kbedell.com, or
* kev(you da man!)in@kbedell.com, or
* kevin@k(evin)bedell.com

Are all valid emails! So are addresses with apostrophes, and more. I mean it literally when I say that whole chapters of books have been written about trying to regex an RFC-compliant version. Its not simple, and you are attempting to make it so - more power to you - but the regex masters have been there, done that, and its not simple.
Pyrite wrote: Also, that function is copyrighted, and I need to be able to distribute the code for such a function in a pretty close to non-restrictive manner.
All code published online is copyrighted - period. You meant to say that the license is too restrictive - and I already answered that. You can buy the book mastering regular expressions, and then use the regex code from the book in your application. Or, you could look for the BSDL-licensed versions of the function I posted - numerous others have translated it from the book slightly differently.

The app I used it in is under the GPL, and the source I used had posted it under the GPL. It was clearly different than killersoft's version.

Point being - there is ONE definitive regex, I pointed it out, and you can use it and license it, as long as you give proper attribution - if you find a version licensed in a way that allows that.

You did follow up with a question --
Pyrite wrote: Also, that script validates these addresses as valid/true. And we all know they aren't.

"betty@hotmail"
"betty&@hotmail"
" betty@hotmail.com"
Actually - we don't all know that. The addresses themselves are in fact RFC valid. The first two arent ROUTABLE because the TLD (top-level-domain) isnt real. (there is no . TLD that is routable, sorta). Thats not email validation - that is DOMAIN validation, a very different problem set.

I won't spend a ton of time explaining why the last is in fact valid.. the RFC itself could allow you to spend HOURS dissecting an address into "words, atoms, and values", and the allowed contents of each. However, I will simplify by saying once again - the TLD .com" doesn't exist - only .com does. That makes it a DOMAIN validation, not an email validation.

The RFC makes it very clear that DOMAIN validation is NOT email validation, and you are confusing them. :)

http://www.faqs.org/rfcs/rfc822.html

As to the question of why to allow the backslash - it is allowed. You have to use it to 'escape' other characters like white space (legal), and " characters.

Like I said, hideously complicated.

Posted: Wed Mar 24, 2004 1:25 am
by Pyrite
Late Reply, but Thank You Roja and redmonkey for your help, I understand a lot more. Last question, I am using a combination of two functions, one two validate domains in an email address and the other syntax/format using redmonkey's posted function. But his function does not validate "bob@city.museum" , ideas?

Thanks again!!

Posted: Wed Mar 24, 2004 6:16 am
by redmonkey
If you are already validating the domain and just want to validate the format then the quick way is to just change {2,4} to {2,6} which will then accept museum as valid..

Posted: Fri Mar 26, 2004 12:09 am
by Pyrite
Excellent Thanks!!