Regex: Email Format Validation

Small, short code snippets that other people may find useful. Do you have a good regex that you would like to share? Share it! Even better, the code can be commented on, and improved.

Moderator: General Moderators

User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Regex: Email Format Validation

Post by Maugrim_The_Reaper »

okie, Roja has a nifty RFC compliant regex function to validate an email address. Unfortunately the sticky term "GPL" is mentioned a few times. The following is a non-GPL free to use under copyright version. It's a straight forward translation from the Perl source at http://examples.oreilly.com/regex/email-opt.pl

You can view Roja's GPL version at: http://svn.gna.org/viewcvs/blacknova/tr ... iew=markup


Code: Select all

<?php
/*
    The "Mastering Regular Expressions" Email Regex (from book on page 295 et seq)
 
    Based on optimised email regex in Perl at http://examples.oreilly.com/regex/email-opt.pl
    Copyright 1997 O'Reilly & Associates, Inc.
    Changes submitted includes this static class structure, and translation from Perl to PHP syntax
    Changes (c) 2005 Padraic Brady (this version only)
 
    Original file header below in EmailValidator::isValid() method source code
    This static class tests compliance of the email format with RFC 822, the current definitive standard for
    email address formatting. Note: Compliance to RFC 2822 is not checked, since this RFC is Proposed
    and would reject addresses currently in use.
*/
 
/* Usage:
        EmailFormatValidator::isValid('myname@mydomain.com');
        return (integer) 1 on valid email
*/
 
class EmailFormatValidator {
 
    function EmailFormatValidator() {
        trigger_error('Static calling only to EmailFormatValidator::isValid()', E_USER_NOTICE);
    }
 
    // static method
    function isValid($email=null) {
        if(is_null($email) || empty($email)) 
        {
            return false;
        }
 
        //
        // Program to build a regex to match an internet email address,
        // from Chapter 7 of _Mastering Regular Expressions_ (Friedl / O'Reilly)
        // (http://www.ora.com/catalog/regexp/)
        //
        // Optimized version.
        //
        // Copyright 1997 O'Reilly & Associates, Inc.
        //
        
        // Some things for avoiding <span style='color:blue' title='I&#39;m naughty, are you naughty?'>smurf</span> later on.
        $esc        = '\\\\';               $Period      = '\.';
        $space      = '\040';               $tab         = '\t';
        $OpenBR     = '\[';                 $CloseBR     = '\]';
        $OpenParen  = '\(';                 $CloseParen  = '\)';
        $NonASCII   = '\x80-\xff';          $ctrl        = '\000-\037';
        $CRlist     = '\n\015';  // note: this should really be only \015.
 
        // Items 19, 20, 21
        $qtext = "[^$esc$NonASCII$CRlist\"]";               // for within "..."
        $dtext = "[^$esc$NonASCII$CRlist$OpenBR$CloseBR]";  // for within [...]
        $quoted_pair = " $esc [^$NonASCII] ";               // an escaped character
 
        //#############################################################################
        // Items 22 and 23, comment.
        // Impossible to do properly with a regex, I make do by allowing at most one level of nesting.
        $ctext = " [^$esc$NonASCII$CRlist()] ";
 
        // $Cnested matches one non-nested comment.
        // It is unrolled, with normal of $ctext, special of $quoted_pair.
        $Cnested =
            "$OpenParen"                        //  (
            ."$ctext*"                          //     normal*
            ."(?: $quoted_pair $ctext* )*"      //     (special normal*)*
            ."$CloseParen"                      //                       )
        ;
 
        // $comment allows one level of nested parentheses
        // It is unrolled, with normal of $ctext, special of ($quoted_pair|$Cnested)
        $comment =
            "$OpenParen"                        //  (
            ."$ctext*"                          //     normal*
            .'(?:'                              //       (
            ."(?: $quoted_pair | $Cnested )"    //         special
            ."$ctext*"                          //         normal*
            .')*'                               //            )*
            ."$CloseParen"                      //                )
        ;
 
        //#############################################################################
 
        // $X is optional whitespace/comments.
        $X =
            "[$space$tab]*"                     // Nab whitespace.
            ."(?: $comment [$space$tab]* )*"    // If comment found, allow more spaces.
        ;
 
        // Item 10: atom
        $atom_char   = "[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASCII]";
        $atom =
            "$atom_char+"                       // some number of atom characters...
            ."(?!$atom_char)"                   // ..not followed by something that could be part of an atom
        ;
 
        // Item 11: doublequoted string, unrolled.
        $quoted_str =
            "\""                                // "
            ."$qtext *"                         //   normal
            ."(?: $quoted_pair $qtext * )*"     //   ( special normal* )*
            ."\""                               //        "
        ;
 
        // Item 7: word is an atom or quoted string
        $word =
            '(?:'
            ."$atom"                            // Atom
            .'|'                                //  or
            ."$quoted_str"                      // Quoted string
            .')'
        ;
 
        // Item 12: domain-ref is just an atom
        $domain_ref  = $atom;
 
        // Item 13: domain-literal is like a quoted string, but [...] instead of  "..."
        $domain_lit  =
            "$OpenBR"                           // [
            ."(?: $dtext | $quoted_pair )*"     //    stuff
            ."$CloseBR"                         //           ]
        ;
 
        // Item 9: sub-domain is a domain-ref or domain-literal
        $sub_domain  =
            '(?:'
            ."$domain_ref"
            .'|'
            ."$domain_lit"
            .')'
            ."$X"                               // optional trailing comments
        ;
 
        // Item 6: domain is a list of subdomains separated by dots.
        $domain =
             "$sub_domain"
             .'(?:'
                ."$Period $X $sub_domain"
             .')*'
        ;
 
        // Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon.
        $route =
            "\@ $X $domain"
            ."(?: , $X \@ $X $domain )*"        // additional domains
            .':'
            ."$X"                               // optional trailing comments
        ;
 
        // Item 6: local-part is a bunch of $word separated by periods
        $local_part =
            "$word $X"
            .'(?:'
            ."$Period $X $word $X"              // additional words
            .')*'
        ;
 
        // Item 2: addr-spec is local@domain
        $addr_spec  = "$local_part \@ $X $domain";
 
        // Item 4: route-addr is <route? addr-spec>
        $route_addr =
            "< $X"                              // <
            ."(?: $route )?"                    //       optional route
            ."$addr_spec"                       //       address spec
            .'>'                                //                 >
        ;
 
        // Item 3: phrase........
        $phrase_ctrl = '\000-\010\012-\037';    // like ctrl, but without tab
 
        // Like atom-char, but without listing space, and uses phrase_ctrl.
        // Since the class is negated, this matches the same as atom-char plus space and tab
        $phrase_char = "[^()<>\@,;:\".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]";
 
        // We've worked it so that $word, $comment, and $quoted_str to not consume trailing $X
        // because we take care of it manually.
        $phrase =
            "$word"                             // leading word
            ."$phrase_char *"                   // "normal" atoms and/or spaces
            .'(?:'
            ."(?: $comment | $quoted_str )"     // "special" comment or quoted string
            ."$phrase_char *"                   //  more "normal"
            .")*"
        ;
 
        // Item #1: mailbox is an addr_spec or a phrase/route_addr
        $mailbox =
            "$X"                                // optional leading comment
            .'(?:'
            ."$addr_spec"                       // address
            .'|'                                //  or
            ."$phrase  $route_addr"             // name and address
            .')'
        ;
 
        // EOF Email RFC regex
 
        // perform actual regex check to our recieved email address
        $isValid = preg_match("/^$mailbox$/xS",$email);
        return $isValid;
    }
    
}
 
?>
Last edited by feyd on Fri Aug 01, 2008 4:26 pm, edited 3 times in total.
Reason: Fix tags.
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

Code: Select all

$email = '"thisis a test"."of validation"@c';
 
echo validateEmailFormat($email) ? 'Yep' : 'Nope';
 
echo "\x0a";
 
echo EmailValidator::isValid($email) ? 'Yep' : 'Nope';
 
Outputs....

Code: Select all

 
Yep
Yep
 
I wonder where these mails would end up?
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

"thisis a test"."of validation"@c

This is a valid email address format ;) Under the RFC local names containing spaces must be quoted. Periods in the local name are valid. And the domain need not contain a .com or such after. Therefore the Regex Roja has implemented will indeed test true.

RFC 822 states...
For example, the address:

First.Last@Registry.Org

is legal and does not require the local-part to be surrounded
with quotation-marks. (However, "First Last" DOES require
quoting.)
Of course, since the next logical test is to actually send a validation email to the address for confirmation, therefore your email address it would ultimately fail. The point is email addresses of the form you tested WOULD be passed - whereas they would not under other more commonly used regular expressions. This is a pain to people with such addresses. I still have issues with my own address on some sites which uses a common First.Last format for the local name...:(

Actually I think you already know all this... Why else such an obvious address that is valid? I assume you probably meant to point out that this is not the only step that should be used - that the email should be tested and verified.
Roja
Tutorials Group
Posts: 2692
Joined: Sun Jan 04, 2004 10:30 pm

Post by Roja »

redmonkey wrote:I wonder where these mails would end up?
The function I use isn't meant to test whether an email account is active, routable, and in use.

It is meant to test compliance TO the RFC. The subset of email addresses that are routable is (as you imply) smaller than the RFC allows. The subset of THOSE addresses that are in use or active is also even smaller.

Both are things a regex cannot easily accomplish (imho). However, if you send a validation email, and the user doesn't receive it, you've tested the last two successfully.

The goal of the function was to ensure that an email address provided complies with the RFC, to reduce the number of incorrectly formatted emails the server sends.

But taking it a step further: Yes, you might also want to validate the latter two issues as well, but thats a seperate problem.
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

Maugrim_The_Reaper wrote:RFC 822 states...
RFC 822 is obsolete
Roja wrote:The function I use isn't meant to test whether an email account is active, routable, and in use.

It is meant to test compliance TO the RFC.
I realise the purpose of the function, however, stating 'compliance TO the RFC.' in both these functions is a somewhat general statement (RFC what exactly?) and also inaccurate as both these functions 'appear' to have been written referencing obsolete guidelines (RFC 822 in this case).

Apart from the points I've noted above, I would also question the 'suitability for purpose'. By that I mean as there is no/little documentation and no examples or notes as to usage of this function I think the goals of this function may be misleading for the average visitior to this site (<-- that's just my assumption and is by no means to be taken as a condecending remark towards any vistor(s) to this site)

Just to give a clearer idea as to what you can expect from this function, the following email address formats also return a valid response from this function (expected behaviour)...

recipient@onelevel
John Smith <john.smith@domain.com>
A User <a.user@domain>


These are just a few examples, there are other formats which also pass as valid. So my assumption is that this function while does what it says on the tin, is probably not what most vistors to this site would be looking for (or indeed expect).
Maugrim_The_Reaper wrote:Of course, since the next logical test is to actually send a validation email to the address for confirmation, therefore your email address it would ultimately fail.
Roja wrote:However, if you send a validation email, and the user doesn't receive it, you've tested the last two successfully.
Again, based on assumption of purpose, personally I would consider this function to be only marginally better than not checking the address format at all.
Roja wrote:The goal of the function was to ensure that an email address provided complies with the RFC, to reduce the number of incorrectly formatted emails the server sends.
And although admittedly gets you closer to achieving that goal than not checking at all, I'm not convinced that checking compliance with RFC 822 would be the correct direction to persue when determining the valid format of a user supplied email address.
Roja
Tutorials Group
Posts: 2692
Joined: Sun Jan 04, 2004 10:30 pm

Post by Roja »

redmonkey wrote:
Maugrim_The_Reaper wrote:RFC 822 states...
RFC 822 is obsolete
Thats not totally accurate.

For address format, it is still the definitive RFC. It has been *updated* by 1123, and 2156, but neither obsoletes the address formatting portion of 822, so your statement doesnt matter for this discussion. Further, 1123 clarifies ("Updates") 822, and in fact specifically mandates:
RFC-1123 wrote:an implementation generally needs to recognize and correctly interpret all of the RFC-822 syntax.
redmonkey wrote:I realise the purpose of the function, however, stating 'compliance TO the RFC.' in both these functions is a somewhat general statement (RFC what exactly?) and also inaccurate as both these functions 'appear' to have been written referencing obsolete guidelines (RFC 822 in this case).
To the first question, the definitive current guideline for email address formatting: RFC-822. That portion of the standard is not obsolete. To the second, no, they are written to the current (not obsolete) guidelines.
redmonkey wrote:Just to give a clearer idea as to what you can expect from this function, the following email address formats also return a valid response from this function (expected behaviour)...

recipient@onelevel
John Smith <john.smith@domain.com>
A User <a.user@domain>
Great examples! Each in fact is BOTH valid, AND routable - in the right environment.

If I am running the onelevel domain as my localdomain, I can in fact send mail in my network to recipient@onelevel. The second address is perfectly usable in virtually any client. The third is again routable if my local domain is "domain".

This is why RFC's exist. To codify complex issues and make it possible for implementors to write compliant code. We've done so, and your misunderstanding about the usability and purpose in the real world is leading you to confusion and to (inaccurately) criticizing our work.
redmonkey wrote:Apart from the points I've noted above, I would also question the 'suitability for purpose'. By that I mean as there is no/little documentation and no examples or notes as to usage of this function I think the goals of this function may be misleading for the average visitior to this site (<-- that's just my assumption and is by no means to be taken as a condecending remark towards any vistor(s) to this site)
There is little documentation needed. You give it an email, and it either returns TRUE (valid), or FALSE (not valid).

If you have a competing function that you'd like to offer, please do.
redmonkey wrote: So my assumption is that this function while does what it says on the tin, is probably not what most vistors to this site would be looking for (or indeed expect).
I've already explained this issue. It tests the email *format*, not the deliverability.
redmonkey wrote:Again, based on assumption of purpose, personally I would consider this function to be only marginally better than not checking the address format at all.
Your opinion of its suitability to task is flawed. It solves the exact problem we said it does: It determines the validity of a user supplied email address, and does it to the definitive requirements.
redmonkey wrote:And although admittedly gets you closer to achieving that goal than not checking at all, I'm not convinced that checking compliance with RFC 822 would be the correct direction to persue when determining the valid format of a user supplied email address.
Using your wording, it is the definitive way to determine the valid format of a user supplied email address.

Just because you'd like it to also:

- Verify deliverability
- Verify domain
- Verify user exists

Doesn't change the fact that it is the definitive way to determine the valid format of a user supplied email address.

If you want to offer classes/functions/snippets that go beyond this class to do those things, great - please do. Until then, please don't insult and argue over the work we are offering that solves the problem we set out to solve.
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

Please note, my intention is not to insult or argue, merely question/debate the relevance of this function in it's impliead useage.

RFC 2822 obsoletes RFC 822, the following example mail addresses which are RFC 822 compliant.....

Code: Select all

john.smith@test   . example
John Smith <@amachine.tld:john.smith@example.net>
....are now considered obsolete.

Also, I wouldn't expect/want this function to also verify deliverability, domain or user existance, that is an assumption on your part.

My assumption (and I could be wrong) is that this function is being suggested/offered as part of a user registration/sign up process within a PHP web based application.

I do not dispute that the function achieves the goal of recognizing an RFC 822 formatted email address, but as stated in my previous post, I'm not convinced that it's the best approach in this type of application.

I do not have comparable code/classes/functions etc.. as I have no need to verify compliance of an email address against RFC 822 guidelines.
Roja
Tutorials Group
Posts: 2692
Joined: Sun Jan 04, 2004 10:30 pm

Post by Roja »

redmonkey wrote:Please note, my intention is not to insult or argue, merely question/debate the relevance of this function in it's impliead useage.
The "implied usage" continues to be your assumption, as opposed to the stated purpose. Please leave your assumptions behind, and focus on the stated purpose.
redmonkey wrote:RFC 2822 obsoletes RFC 822, the following example mail addresses which are RFC 822 compliant.....
Please read the section in 2822 which discusses how a parser (thats what our function is) should handle the differences between 822 and 2822:
RFC-2822 wrote:Though some of these syntactic forms MUST NOT be generated according to the grammar in section 3, they MUST be accepted and parsed by a conformant receiver.
Thus, once more, we are following the definitive standard.
redmonkey wrote:My assumption (and I could be wrong) is that this function is being suggested/offered as part of a user registration/sign up process within a PHP web based application.
You are wrong. It is meant as a method to (one more time) determine the valid format of a user supplied email address. Other than looking to argue, I do not see how you can honestly argue that that was not clear in the previous postings.
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

Has anyone mentioned RFC 2822 has a status of "Proposed"?

Think my last reply vanished in Limbo - oh well...I'll make the main points again.

The only valid point I can see is that the title may be slightly misleading - therefore I will amend to EmailFormatValidator to avoid confusion where newbies may mis-interpret what a Regex is used for...

There's an interesting story about obsolete standards. We can refer to it as the "Look! It's still in use!" tale. It's a plot full of intrigue, misinformation, shadowy characters, and dagger and cloak debates in dark corners...usually Perl forums ;) You seem to be confused about what Roja's function (and my related class based on his source material) intend. Delivering such ridiculous statements like RFC 822 is obsolete is not accurate. It's still used in reality - and reality, not some theoretical perfect world, is what counts. Let's leave theory at home, and let reailty have a little fresh air.

A strict RFC 2822 implementation would reject valid RFC 822 addresses WHICH ARE IN USE. Hence it's not logical to be so strict. It would negate the purpose of the regex. Which is to validate FORMAT. Since the class posted, and Roja's referenced function are successful in that respect they do exactly what they were intended to do.

It would be interesting to see if any RFC 2822 address formats exists which cannot be validated though the RFC 822 regex. Now that would be a useful complaint...[/quote]
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

Either I am not making myself clear or you are failing to comprehend my comments. Either way I do not wish to waste any more of my time or yours, from your repetative comments it's clear we are not getting anywhere with this so I'll bid your fair day.
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

As you wish - all I can say is that if you want to present an argument you require evidence in addition to statements of assumed fact. Since you have presented none, and Roja and myself have presented legitimate references to the RFC's, please don't imply we are being repetitive.

Now that IS insulting. :roll:

Fact: RFC 2822 is Proposed
Fact: RFC 822 addresses ARE in use
Fact: We don't want to reject RFC 822 addresses
Fact: We therefore match to RFC 822 regex (and see also 2822's ref. on differences)
Fact: The posted class works correctly and accurately
User avatar
n00b Saibot
DevNet Resident
Posts: 1452
Joined: Fri Dec 24, 2004 2:59 am
Location: Lucknow, UP, India
Contact:

Post by n00b Saibot »

BIGGEST FACT: This topic is not worth all this hassle. The author presented a snippet 'to parse the valid email addresses'. if you like it use it, otherwise there are hundreds of email regexs lying out there. There is no point in arguing endlessly when no it will have no outcome......
And I fear the topic title really is misleading, for that matter.
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

Edited...

Can Email Format Validator be any more clear? I'm all for making its purpose as obvious as possible...
Afterlife(69)
Spammer :|
Posts: 14
Joined: Mon Oct 03, 2005 4:51 am

Post by Afterlife(69) »

Code: Select all

preg_match('/^[a-z0-9&\'\.\-_\+]+@[a-z0-9\-]+\.([a-z0-9\-]+\.)*?[a-z]+$/is', $email)
this validates fine for me.
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

Try maugrim@mydomain or "Maugrim Reaper"@mydomain.com

Your regex works in some cases by being extra permissive in what it accepts as a valid format...

The regex Roja posted (and I transferred from the perl source) complies to the official RFC 822 (while RFC 2822 is proposed at least). Being extra permissive however will allow your regex to cope better than more I've seen - at least it tries to address the obvious problems with validating an email format...;)
Post Reply