RFC-compliant email validation

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

RFC-compliant email validation

Post by Luke »

I'm using this function to validate email addresses and it allows me@me to go through. Why... is that rfc-compliant?

Code: Select all

<?php
// Copyright (C) 2001 Ron Harwood and L. Patrick Smallwood
// This program is free software; you can redistribute it and/or
// modify it under the terms of the GNU General Public License
// as published by the Free Software Foundation; either version 2
// of the License, or (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
//
// File: functions/validateemailformat.php

function validateEmailFormat ($email)
{
    // This is based on page 295 of the book 'Mastering Regular Expressions' - the most 
    // definitive RFC-compliant email regex.

    // Some shortcuts for avoiding <span style='color:blue' title='I&#39;m naughty, are you naughty?'>smurf</span>
    $esc        = '\\\\';
    $Period      = '\.';
    $space      = '\040';
    $tab         = '\t';
    $OpenBR     = '\[';
    $CloseBR     = '\]';
    $OpenParen  = '\(';
    $CloseParen  = '\)';
    $NonASCII   = '\x80-\xff';
    $ctrl        = '\000-\037';
    $CRlist     = '\n\015';  // note: this should really be only \015.

    // Items 19, 20, 21 -- see table on page 295 of 'Mastering Regular Expressions'
    $qtext = "[^$esc$NonASCII$CRlist\"]";              // for within "..."
    $dtext = "[^$esc$NonASCII$CRlist$OpenBR$CloseBR]"; // for within [...]
    $quoted_pair = " $esc [^$NonASCII] ";              // an escaped character

    // Items 22 and 23, comment.
    // Impossible to do properly with a regex, I make do by allowing at most 
    // one level of nesting.
    $ctext = " [^$esc$NonASCII$CRlist()] ";

    // $Cnested matches one non-nested comment.
    // It is unrolled, with normal of $ctext, special of $quoted_pair.
    $Cnested = "";
    $Cnested .= "$OpenParen";                     // (
    $Cnested .= "$ctext*";                        //       normal*
    $Cnested .= "(?: $quoted_pair $ctext* )*";    //       (special normal*)*
    $Cnested .= "$CloseParen";                    //                         )
    
    // $comment allows one level of nested parentheses
    // It is unrolled, with normal of $ctext, special of ($quoted_pair|$Cnested)
    $comment = "";
    $comment .= "$OpenParen";                     //  (
    $comment .= "$ctext*";                        //     normal*
    $comment .= "(?:";                            //       (
    $comment .= "(?: $quoted_pair | $Cnested )";  //         special
    $comment .= "$ctext*";                        //         normal*
    $comment .= ")*";                             //            )*
    $comment .= "$CloseParen";                    //                )
        
    // $X is optional whitespace/comments
    $X = "";
    $X .= "[$space$tab]*";                  // Nab whitespace
    $X .= "(?: $comment [$space$tab]* )*";  // If comment found, allow more spaces
        
        
    // Item 10: atom
    $atom_char = "[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASCII]";
    $atom = "";
    $atom .= "$atom_char+";    // some number of atom characters ...
    $atom .= "(?!$atom_char)"; // ... not followed by something that 
                               //     could be part of an atom
                                    
    // Item 11: doublequoted string, unrolled.
    $quoted_str = "";
    $quoted_str .= "\"";                            // "
    $quoted_str .= "$qtext *";                      //   normal
    $quoted_str .= "(?: $quoted_pair $qtext * )*";  //   ( special normal* )*
    $quoted_str .= "\"";                            //        "
    
    
    // Item 7: word is an atom or quoted string
    $word = "";
    $word .= "(?:";
    $word .= "$atom";        // Atom
    $word .= "|";            // or
    $word .= "$quoted_str";  // Quoted string
    $word .= ")";
        
    // Item 12: domain-ref is just an atom
    $domain_ref = $atom;
    
    // Item 13: domain-literal is like a quoted string, but [...] instead of "..."
    $domain_lit = "";
    $domain_lit .= "$OpenBR";                        // [
    $domain_lit .= "(?: $dtext | $quoted_pair )*";   //   stuff
    $domain_lit .= "$CloseBR";                       //         ]

    // Item 9: sub-domain is a domain-ref or a domain-literal
    $sub_domain = "";
    $sub_domain .= "(?:";
    $sub_domain .= "$domain_ref";
    $sub_domain .= "|";
    $sub_domain .= "$domain_lit";
    $sub_domain .= ")";
    $sub_domain .= "$X"; // optional trailing comments
        
    // Item 6: domain is a list of subdomains separated by dots
    $domain = "";
    $domain .= "$sub_domain";
    $domain .= "(?:";
    $domain .= "$Period $X $sub_domain";
    $domain .= ")*";
        
    // Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon.
    $route = "";
    $route .= "\@ $X $domain";
    $route .= "(?: , $X \@ $X $domain )*"; // additional domains
    $route .= ":";
    $route .= "$X"; // optional trailing comments
        
    // Item 5: local-part is a bunch of $word separated by periods
    $local_part = "";
    $local_part .= "$word $X";
    $local_part .= "(?:";
    $local_part .= "$Period $X $word $X"; // additional words
    $local_part .= ")*";
        
    // Item 2: addr-spec is local@domain
    $addr_spec = "$local_part \@ $X $domain";

    // Item 4: route-addr is <route? addr-spec>
    $route_addr = "";
    $route_addr .= "< $X";
    $route_addr .= "(?: $route )?"; // optional route
    $route_addr .= "$addr_spec";    // address spec
    $route_addr .= ">";
        
    // Item 3: phrase........
    $phrase_ctrl = '\000-\010\012-\037'; // like ctrl, but without tab
    
    // Like atom-char, but without listing space, and uses phrase_ctrl.
    // Since the class is negated, this matches the same as atom-char plus space and tab
    $phrase_char = "[^()<>\@,;:\".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]";

    // We've worked it so that $word, $comment, and $quoted_str to not consume trailing
    // $X because we take care of it manually.
    $phrase = "";
    $phrase .= "$word";                            // leading word
    $phrase .= "$phrase_char *";                   // "normal" atoms and/or spaces
    $phrase .= "(?:";
    $phrase .= "(?: $comment | $quoted_str )";     // "special" comment or quoted string
    $phrase .= "$phrase_char *";                   //  more "normal"
    $phrase .= ")*";

    // Item 1: mailbox is an addr_spec or a phrase/route_addr
    $mailbox = "";
    $mailbox .= "$X";                    // optional leading comment
    $mailbox .= "(?:";
    $mailbox .= "$addr_spec";            // address
    $mailbox .= "|";                     // or
    $mailbox .= "$phrase  $route_addr";  // name and address
    $mailbox .= ")";

    // test it and return results
    $isValid = preg_match("/^$mailbox$/xS",$email);

    return $isValid;
} // END validateEmailFormat
?>
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

emails existed long before the internet my friend..

yes it is valid
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Post by Luke »

oh yea.. I suppose you're right. Thanks!
User avatar
The Phoenix
Forum Contributor
Posts: 294
Joined: Fri Oct 06, 2006 8:12 pm

Post by The Phoenix »

As a follow-up to this post, it turns out that the function at top of thread is *not* RFC-compliant.

Relatively soon, I will be posting a google code project with a list of tests to help clarify when a function is not RFC-compliant.

Of roughly a dozen email regexes, the one at top of thread was the second best by the criteria I used, so is still relatively strong. I'll also be posting all the variations on regexes found. I will update the forums when I've done so, and maybe we can collectively hunt for better tests and better regexes/functions to solve the problem.

It seems to be a difficult problem to solve well.
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Post by Benjamin »

The best validation is probably just to send them an email with a validation link :wink:
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

astions wrote:The best validation is probably just to send them an email with a validation link :wink:
Validating the email address as possible and validating that the email address is real are two different things. Related, but different.
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Post by Benjamin »

Well, say what you will, but if they get the email, it's valid.

EDIT: Nevermind, misread what you wrote.
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Post by Luke »

feyd is not arguing that point. that is absolutely correct, but there are times when you need to validate whether an email "looks" valid, but you don't necessarily need to know if it exists.

EDIT: wrote this before you edited your post :)
User avatar
The Phoenix
Forum Contributor
Posts: 294
Joined: Fri Oct 06, 2006 8:12 pm

Post by The Phoenix »

astions wrote:Well, say what you will, but if they get the email, it's valid.

EDIT: Nevermind, misread what you wrote.
Sadly, not true. There are mailservers (and mail clients!) that ignore the RFC, and can pass mail that would be an invalid format against a pure RFC regex.

So no, receiving the email doesn't mean its an RFC-valid address. It *does* mean its a real/actual email, which in many ways is a more worthwhile test.

But they are separate tests. There is a value and purpose for each.
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Post by Benjamin »

The Phoenix wrote:[Sadly, not true. There are mailservers (and mail clients!) that ignore the RFC, and can pass mail that would be an invalid format against a pure RFC regex.

So no, receiving the email doesn't mean its an RFC-valid address. It *does* mean its a real/actual email, which in many ways is a more worthwhile test.

But they are separate tests. There is a value and purpose for each.
I was thinking about this earlier tonight, and I came to the conclusion that RFC is due for a good revision anyway. If it wasn't such a PITA to validate, these issues... wouldn't be issues.

Who needs comments in an email address anyway?
User avatar
The Phoenix
Forum Contributor
Posts: 294
Joined: Fri Oct 06, 2006 8:12 pm

Post by The Phoenix »

astions wrote:I was thinking about this earlier tonight, and I came to the conclusion that RFC is due for a good revision anyway. If it wasn't such a PITA to validate, these issues... wouldn't be issues.

Who needs comments in an email address any?
I won't disagree there. However, this is one of those interesting chicken/egg problems.

You need the RFC because email is popular, and widespread. But if you simplify the RFC (to make it easier to validate), you also break hundreds (thousands? millions?) of installed servers, clients, and other pieces of software - making 'email' less functional worldwide overnight (in theory).

The 'good enough' answer is to live with the complex RFC, and find a pretty solid regex that comes darn close while preventing many false negatives. That, coupled with sending an activation email? 95% of the world's email validation needs would be met, I suspect. (Especially since that seems to be the case across the board until now!)
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Post by Mordred »

The Phoenix wrote:As a follow-up to this post, it turns out that the function at top of thread is *not* RFC-compliant.

Relatively soon, I will be posting a google code project with a list of tests to help clarify when a function is not RFC-compliant.

Of roughly a dozen email regexes, the one at top of thread was the second best by the criteria I used, so is still relatively strong. I'll also be posting all the variations on regexes found. I will update the forums when I've done so, and maybe we can collectively hunt for better tests and better regexes/functions to solve the problem.

It seems to be a difficult problem to solve well.
Very nice, Phoenix, I'll be waiting for your tests. Will you be recommending a code snippet then, or maybe even *fixing* one so it is fully(*) compliant?

(*) for a realistic value of "fully", of course ;)
User avatar
The Phoenix
Forum Contributor
Posts: 294
Joined: Fri Oct 06, 2006 8:12 pm

Post by The Phoenix »

Mordred wrote: Very nice, Phoenix, I'll be waiting for your tests. Will you be recommending a code snippet then, or maybe even *fixing* one so it is fully(*) compliant?

(*) for a realistic value of "fully", of course ;)
I'll have all the functions/regexes I've found and worked with included, so you'll be able to pick the one that suits best. I haven't found a solution that is fully compliant yet.
Post Reply