Regex: Email Format Validation

Small, short code snippets that other people may find useful. Do you have a good regex that you would like to share? Share it! Even better, the code can be commented on, and improved.

Moderator: General Moderators

User avatar
Jaxolotl
Forum Contributor
Posts: 137
Joined: Mon Nov 13, 2006 4:19 am
Location: Argentina and Italy

Post by Jaxolotl »

I use this regexp, it's a bit (bit? :) ) permissive but works find 'til now

Code: Select all

 
function validate_email($email_string) {
 
    if(eregi("^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$", trim($email_string))){
        return true;
    }
    else{
        echo "my error messege";
        return false;
    }
}
whet do you think about it?
is it necessary to implement a strict one?
in which cases do you recommend to improve it?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

If you're going to use regex, you may as well use the fully standards compliant one.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

It's fun to see people being pedantic about web standards, as it's a problem I run into all the time when trying to write high-quality code.

I have little doubt that the code supplied correctly verifies RFC 822 addresses: while a bunch of unit tests would be nice, I trust you guys to write the correct stuff.

Whether or not, however, this code is practical to use in a real world setting is not. This is the point I think redmonkey brought up and failed to realize that the discussion, to this point, was purely theoretical.

I think it would be useful to also discuss how practical it would be to use such a monster regex. There were several things brought up already, and also some more topics:

1. Whether or not such complex processing is required for a validation process that will end up being further checked through, say, a verification email (on a similar tack, can you get away with no processing if you send out validation emails?)
2. What would a good, practical and concise email regex be that would adhere to both the RFC and real world usage of email addresses?
3. Under what circumstances would such strict checking be merited? Before you sent an email to the address? For inclusion in a mailto link?
4. How would one go about making the regex faster without sacrificing RFC-compliance? Also, could you make a regex that parses the email into its component parts so that you could do more fine-grained filtering?
5. Why doesn't PHP have a native RFC-compliant email validation function? (okay, maybe filter, but I don't know if it's standards compliant)
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

There are any number of reasons.

1. I really want to validate the email syntax without requiring a verification mail (newsletter signup, temp notices, etc.)
2. To avoid using the common misconceived regex's which are not RFC compliant, but are also not permissive enough to at least allow RFC syntax.
3. The Zend Framework refuses to implement anything else...;)
4. Is the regex that costly for infrequent signups or email submissions?

On ext/filter, the source code implements an RFC822 compliant regex (note the one in this topic is a regex builder - not the actual string regex) which is already in PEAR. See:

http://cvs.php.net/viewvc.cgi/pear/HTML ... iew=markup

See also logical_filters.c in the filter source:
[c]void php_filter_validate_email(PHP_INPUT_FILTER_PARAM_DECL) /* {{{ */{    /* From http://cvs.php.net/co.php/pear/HTML_Qui ... .php?r=1.4 */    const char regexp[] = "/^((\\\"[^\\\"\\f\\n\\r\\t\\v\\b]+\\\")|([\\w\\!\\#\\$\\%\\&\\'\\*\\+\\-\\~\\/\\^\\`\\|\\{\\}]+(\\.[\\w\\!\\#\\$\\%\\&\\'\\*\\+\\-\\~\\/\\^\\`\\|\\{\\}]+)*))@((\\[(((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9])))\\])|(((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9])))|((([A-Za-z0-9\\-])+\\.)+[A-Za-z\\-]+))$/";     pcre       *re = NULL;    pcre_extra *pcre_extra = NULL;    int preg_options = 0;    int         ovector[150]; /* Needs to be a multiple of 3 */    int         matches;      re = pcre_get_compiled_regex((char *)regexp, &pcre_extra, &preg_options TSRMLS_CC);    if (!re) {        RETURN_VALIDATION_FAILED    }    matches = pcre_exec(re, NULL, Z_STRVAL_P(value), Z_STRLEN_P(value), 0, 0, ovector, 3);     /* 0 means that the vector is too small to hold all the captured substring offsets */    if (matches < 0) {        RETURN_VALIDATION_FAILED    } }[/c]
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I really want to validate the email syntax without requiring a verification mail (newsletter signup, temp notices, etc.)
But how do you prove that the email address belongs to the person?
To avoid using the common misconceived regex's which are not RFC compliant, but are also not permissive enough to at least allow RFC syntax.
Good reason. I still have not found such a regex. But once again: would it hurt not to validate at all?
The Zend Framework refuses to implement anything else...
Haha. They should at least provide a faster alternative, but that's cool. Do you have a mailing list discussion you can point me to so I can delve further?
Is the regex that costly for infrequent signups or email submissions?
Not so much for those procedures, but if you're filtering a document of HTML with potentially many mailtos, each regex call is precious.
See also logical_filters.c in the filter source:
When in doubt, check the source. But the regex seems very compact for an email validation regex. In contrast, once fully assembled, $mailbox is 7420 chars long; the other is only 604.
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

No mailing list discussion really, it was decided long ago for Zend_Filter then there was the odd request from users (RFC822 is usually requested) which had the developers pointing out they could only accept a solution from someone who wrote an original version and signed a CLA.
But how do you prove that the email address belongs to the person?
That's a different question :). The Regex only validates syntax/format to RFC822, whether it exists or not is beyond its scope.
But once again: would it hurt not to validate at all?
Depends on the circumstances. If the email is being used as temporary identification (no/rare emailing of user) then it's probably useful to do so. If you have a long mailing list, then it's also useful to remove invalid addresses before starting a mass mail. Within reason, if the user must be sent a message, and they must validate from a link/address in that email message, then the point is probably moot - it's not strictly necessary then.

Another point is to provide user feedback - what if they submit an invalid email by accident?

Is it necessary? Depends. Is it useful? Definitely. Usage is optional.
When in doubt, check the source. But the regex seems very compact for an email validation regex.
So true ;).

I might dig for this myself, but either it's a fantastic Regex or it's extra permissive. RFC822 is pretty complex. Once you get over the basics, there's a ton of detail. I won't rule it out as being a far more optimised regex until its proven either way.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

That's a different question Smile. The Regex only validates syntax/format to RFC822, whether it exists or not is beyond its scope.
But it's a question that comes along often enough to merit consideration. Once again: theory versus practice.
Depends on the circumstances. If the email is being used as temporary identification (no/rare emailing of user) then it's probably useful to do so. If you have a long mailing list, then it's also useful to remove invalid addresses before starting a mass mail. Within reason, if the user must be sent a message, and they must validate from a link/address in that email message, then the point is probably moot - it's not strictly necessary then.
Hmm... that's a very interesting assertion that's inline with what I've noticed other programs like browsers and email clients also follow. Firefox has no qualms about sending mailto:@@@ to your mail client.

If this is true, I do not have to validate the contents inside a mailto: link.
Another point is to provide user feedback - what if they submit an invalid email by accident?
This is usually addressed by requiring user to enter the email twice.
I might dig for this myself, but either it's a fantastic Regex or it's extra permissive. RFC822 is pretty complex. Once you get over the basics, there's a ton of detail. I won't rule it out as being a far more optimised regex until its proven either way.
Would love to see the results.
User avatar
The Phoenix
Forum Contributor
Posts: 294
Joined: Fri Oct 06, 2006 8:12 pm

Post by The Phoenix »

Ambush Commander wrote:1. Whether or not such complex processing is required for a validation process that will end up being further checked through, say, a verification email (on a similar tack, can you get away with no processing if you send out validation emails?)
Whether it is required depends on the situation, I would imagine.

Earlier, the comments implied that regex checking ensures that the email is a valid format, not a valid destination, right? So, by checking the format first, you would eliminate emails being sent to an invalid address. That could mean less attacks against the mailserver, including a denial of service where bots sign up thousands of fake emails.
Ambush Commander wrote:2. What would a good, practical and concise email regex be that would adhere to both the RFC and real world usage of email addresses?
If it allows more than the RFC, then it is too loose. If it doesn't allow enough, then it is too strict. In either case, its not RFC-compliant, I thought.
Ambush Commander wrote:4. How would one go about making the regex faster without sacrificing RFC-compliance? Also, could you make a regex that parses the email into its component parts so that you could do more fine-grained filtering?
Have you tested the regex and found it unacceptably slow? Whats the timing?
StrikerKP
Forum Newbie
Posts: 1
Joined: Fri Jul 20, 2007 10:17 am

Post by StrikerKP »

Well, that kind of email validation is one way to do it, but theres a great post here on how to check email addresses actual existance. Its pretty kickass check it ouut http://www.static-chaos.net/viewtutoria ... Validation
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I have seen that technique before. As far as I can remember, some hosts don't support this method because it means that emails can be "discovered".
cade
Forum Commoner
Posts: 55
Joined: Tue Jul 03, 2007 8:18 pm

Post by cade »

can I have such these validation with PHP:

* Format: "address@domain.xxx" or "domain.co.uk", ...
* Forbidden characters: ?, !, *, ...
* Valid domain: looser@red.mond is not valid
* Valid user: Verify that the user and mailbox really exist
mrkite
Forum Contributor
Posts: 104
Joined: Tue Sep 11, 2007 4:19 am

Post by mrkite »

Woah, old thread.

I've found that roll-your-own email validation is tricky. You don't want to reject someone's valid email address because your code is wrong.

However, there are plenty of places where you want to prevent someone from say, submitting a comma separated email address list, or including headers that will allow them to send spam through you.

I have found that PEAR::Validate is the safest bet. It will reject spammer attempts to mass mail through you, but still allow every valid email address I've run into.
User avatar
The Phoenix
Forum Contributor
Posts: 294
Joined: Fri Oct 06, 2006 8:12 pm

Post by The Phoenix »

cade wrote:can I have such these validation with PHP:
You can choose to write your validation code to allow/disallow whatever you choose. However, that doesn't make it 'valid' or 'invalid' email - it just means it doesn't pass your criteria.

cade wrote: * Format: "address@domain.xxx" or "domain.co.uk", ...
Most will allow those..
cade wrote: * Forbidden characters: ?, !, *, ...
Easy enough
cade wrote: * Valid domain: looser@red.mond is not valid
Thats not tue. It is a valid email address, especially if you are on the mond domain. Many small businesses do similar for internal mail.
cade wrote: * Valid user: Verify that the user and mailbox really exist
That you cannot reliably do. Its not PHP's shortcoming - its a protective feature of most mail servers to prevent spam searches.
cade
Forum Commoner
Posts: 55
Joined: Tue Jul 03, 2007 8:18 pm

Post by cade »

but i have seen this works in somewhere...but don't know what engine they use to valid the existence of mail user..
chandan
Forum Newbie
Posts: 1
Joined: Wed Apr 09, 2008 4:24 am

Re: Regex: Email Format Validation

Post by chandan »

uses this to validate ',' (comma) seperated email addresses. this regular expression not yet performance tuned, but should be good for validating emails.

Expression: ^(\w+(.|_)\w+@\w+\.\w+)(,(\w+(.|_)\w+@\w+\.\w+)|\S)+$
Dont forget to add escape sequence to suit to your env.

-Chandan Benjaram
Post Reply