Email matching problem

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

User avatar
shiznatix
DevNet Master
Posts: 2745
Joined: Tue Dec 28, 2004 5:57 pm
Location: Tallinn, Estonia
Contact:

Email matching problem

Post by shiznatix »

I have been using the same email validation regex for forever and havn't had any problems...until now. The regex I have been using is this:

Code: Select all

/^[a-z0-9]+([_\\.-][a-z0-9]+)*@([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}$/i
but for whatever reason it won't match scott_t_@hotmail.com

My regex abilities are very limited so I was wondering if someone could give me a bit of help or offer a better regex to check against.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Email matching problem

Post by VladSun »

CI uses this in its Validation class:

Code: Select all

"/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix"
EDIT1: Although, I can see now they do not validate the domain name right (it can start with "-").
So ...

Code: Select all

"/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9]+[a-z0-9\-]*\.)+[a-z]{2,6}$/ix"
EDIT2: the same applies to email name:

Code: Select all

"/^([a-z0-9]+[a-z0-9\+_\-]*)(\.[a-z0-9\+_\-]+)*@([a-z0-9]+[a-z0-9\-]+\.)+[a-z]{2,6}$/ix"
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
shiznatix
DevNet Master
Posts: 2745
Joined: Tue Dec 28, 2004 5:57 pm
Location: Tallinn, Estonia
Contact:

Re: Email matching problem

Post by shiznatix »

Using your edit2 I do this:

Code: Select all

 
if (preg_match('/^([a-z0-9]+[a-z0-9\+_\-]*)(\.[a-z0-9\+_\-]+)*@([a-z0-9]+[a-z0-9\-]+\.)+[a-z]{2,6}$/ix', '-scott_t_@-hotmail.com'))
        {
            die('good');
        }
        else
        {
            die('bad');
        }
 
and it dies with "bad" but isn't that a valid email?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Email matching problem

Post by VladSun »

I think it's not - that's why I did the "EDIT2" :)

http://en.wikipedia.org/wiki/E-mail_address - I don't have time to read it
if it not so, use EDIT1 :)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
shiznatix
DevNet Master
Posts: 2745
Joined: Tue Dec 28, 2004 5:57 pm
Location: Tallinn, Estonia
Contact:

Re: Email matching problem

Post by shiznatix »

Although, I can see now they do not validate the domain name right (it can start with "-").
did you mean "can't"?
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Email matching problem

Post by prometheuzz »

shiznatix wrote:I have been using the same email validation regex for forever and havn't had any problems...until now. The regex I have been using is this:

Code: Select all

/^[a-z0-9]+([_\\.-][a-z0-9]+)*@([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}$/i
but for whatever reason it won't match scott_t_@hotmail.com

My regex abilities are very limited so I was wondering if someone could give me a bit of help or offer a better regex to check against.
There are a couple of things wrong with the regex:
- you are escaping meta characters with a \\ while a single \ should be used;
- inside character classes (the stuff inside the square brackets) only the - and the ^ are meta characters (not even always!) and of course the brackets themselves, so there is no need to escape the . (dot)

That said, this is your adjusted regex:

Code: Select all

'/^[a-z0-9]+([_.-][a-z0-9]+)*@([a-z0-9]+([.-][a-z0-9]+)*)+\.[a-z]{2,}$/i'
Now, the reason your regex fail is because you defined the characters before the '@' should match ([_.-][a-z0-9]+)* and since the e-mail address scott_t_@hotmail.com has an underscore before the '@', it fails to match (the end can only match [a-z0-9]).

Here's how your original regex can be rewritten so that it matches the address scott_t_@hotmail.com:

Code: Select all

'/^[a-z0-9]+([-_.a-z0-9]+)*@([a-z0-9]+([.-][a-z0-9]+)*)+\.[a-z]{2,}$/i'
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Email matching problem

Post by prometheuzz »

VladSun wrote:CI uses this in its Validation class:

Code: Select all

"/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix"
...
There is no need to escape the + and in this case the - inside the character class.

Example:

Code: Select all

[-a-c]  // matches '-', 'a', 'b' or 'c'
[ABC-]  // matches 'A', 'B', 'C' or '-'
[^a]    // matches anything except 'a'
[a^+]   // matches 'a', '^' or '+'
So, the - has no special meaning at the start or at the end of a character class
And the ^ only has a special meaning at the start of a character class
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Email matching problem

Post by VladSun »

shiznatix wrote:
Although, I can see now they do not validate the domain name right (it can start with "-").
did you mean "can't"?
Ops...
Yes, I meant "can't"
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Email matching problem

Post by VladSun »

prometheuzz wrote:
VladSun wrote:CI uses this in its Validation class:

Code: Select all

"/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix"
...
There is no need to escape the + and in this case the - inside the character class.
It's so for sure :)
You may write it to: http://codeigniter.com ;)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Email matching problem

Post by prometheuzz »

VladSun wrote:
prometheuzz wrote:
VladSun wrote:CI uses this in its Validation class:

Code: Select all

"/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix"
...
There is no need to escape the + and in this case the - inside the character class.
It's so for sure :)
You may write it to: http://codeigniter.com ;)
Well, I've never worked with that framework, so I wouldn't know what to say to the developers exactly. I mean, I don't know where they've written that regex.
But, since you are familiar with it, feel free to drop them a line if you like.
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Email matching problem

Post by GeertDD »

There is more "wrong" with that CodeIgniter regex. Remember that as long as the regex does not reach a full match it keeps backtracking until every possible combination has been tried out. This could take a long time because of the (.+)* pattern before the ampersand. Also known as exponential matching. Prevent it by using possessive quantifiers.
User avatar
shiznatix
DevNet Master
Posts: 2745
Joined: Tue Dec 28, 2004 5:57 pm
Location: Tallinn, Estonia
Contact:

Re: Email matching problem

Post by shiznatix »

Ok so all together now, what would be the one regex for this to rule them all? I don't mind if some non-ok emails get through so it can be a bit loose.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Email matching problem

Post by prometheuzz »

shiznatix wrote:Ok so all together now, what would be the one regex for this to rule them all? I don't mind if some non-ok emails get through so it can be a bit loose.
A loose match:

Code: Select all

#!/usr/bin/php
<?php
// http://www.regular-expressions.info/email.html
$address = 'scott_t_@hotmail.com';
if(preg_match('/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}$/i', $address)) {
    print "$address is valid\n";
} else {
    print "$address is  not valid\n";
}
?>
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Email matching problem

Post by GeertDD »

prometheuzz wrote:

Code: Select all

/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}$/i
Why not make the first part possessive? The character class cannot match "@", so there is no point in backtracking all the way to the beginning to recheck for an "@".

For example if you provide an invalid email string like "no.ampersand". What will happen? At first the character class will match as much as possible because + is greedy. So it matches the whole string: "no.ampersand". Then it looks for "@" which is not found. Okay, let's start backtracking the regex thinks because + only requires one character. So it will match "no.ampersan" and look for "@" again. Of course, there won't be an "@" ever. It continues to match "no.ampersa", "no.ampers", ..., "no", "n". All useless work which you prevent by making the first character class possessive.

Note that you cannot make the the second character class possessive. It would consume the domain part and fail immediately.

Code: Select all

/^[a-z0-9._%+-]++@[a-z0-9.-]+\.[a-z]{2,4}$/i
The problem is that now the first part of the email can start and/or end with a dot. One way to prevent this is using lookaround.

Code: Select all

/^(?!\.)[a-z0-9._%+-]++(?<!\.)@[a-z0-9.-]+\.[a-z]{2,4}$/i
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Email matching problem

Post by prometheuzz »

GeertDD wrote:
prometheuzz wrote:

Code: Select all

/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}$/i
Why not make the first part possessive? The character class cannot match "@", so there is no point in backtracking all the way to the beginning to recheck for an "@".
Two reasons:
1 - For clarity. The OP seems to be not very familiar with regex, and asked for a loose pattern. And since you already mentioned possessive quantifiers in a previous reply, the OP can Google and find out what this does to his regex.
2 - Since validating e-mail addresses will not be done on large strings*, the time wasted on backtracking when an invalid addresses entered will be next to nothing.

* the text field where the user enters his/her e-mail address should be restricted. No multi line and a fixed width.
Post Reply