Page 1 of 2

Email matching problem

Posted: Fri May 16, 2008 2:49 am
by shiznatix
I have been using the same email validation regex for forever and havn't had any problems...until now. The regex I have been using is this:

Code: Select all

/^[a-z0-9]+([_\\.-][a-z0-9]+)*@([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}$/i
but for whatever reason it won't match scott_t_@hotmail.com

My regex abilities are very limited so I was wondering if someone could give me a bit of help or offer a better regex to check against.

Re: Email matching problem

Posted: Fri May 16, 2008 2:53 am
by VladSun
CI uses this in its Validation class:

Code: Select all

"/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix"
EDIT1: Although, I can see now they do not validate the domain name right (it can start with "-").
So ...

Code: Select all

"/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9]+[a-z0-9\-]*\.)+[a-z]{2,6}$/ix"
EDIT2: the same applies to email name:

Code: Select all

"/^([a-z0-9]+[a-z0-9\+_\-]*)(\.[a-z0-9\+_\-]+)*@([a-z0-9]+[a-z0-9\-]+\.)+[a-z]{2,6}$/ix"

Re: Email matching problem

Posted: Fri May 16, 2008 3:26 am
by shiznatix
Using your edit2 I do this:

Code: Select all

 
if (preg_match('/^([a-z0-9]+[a-z0-9\+_\-]*)(\.[a-z0-9\+_\-]+)*@([a-z0-9]+[a-z0-9\-]+\.)+[a-z]{2,6}$/ix', '-scott_t_@-hotmail.com'))
        {
            die('good');
        }
        else
        {
            die('bad');
        }
 
and it dies with "bad" but isn't that a valid email?

Re: Email matching problem

Posted: Fri May 16, 2008 3:35 am
by VladSun
I think it's not - that's why I did the "EDIT2" :)

http://en.wikipedia.org/wiki/E-mail_address - I don't have time to read it
if it not so, use EDIT1 :)

Re: Email matching problem

Posted: Fri May 16, 2008 3:38 am
by shiznatix
Although, I can see now they do not validate the domain name right (it can start with "-").
did you mean "can't"?

Re: Email matching problem

Posted: Fri May 16, 2008 4:53 am
by prometheuzz
shiznatix wrote:I have been using the same email validation regex for forever and havn't had any problems...until now. The regex I have been using is this:

Code: Select all

/^[a-z0-9]+([_\\.-][a-z0-9]+)*@([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}$/i
but for whatever reason it won't match scott_t_@hotmail.com

My regex abilities are very limited so I was wondering if someone could give me a bit of help or offer a better regex to check against.
There are a couple of things wrong with the regex:
- you are escaping meta characters with a \\ while a single \ should be used;
- inside character classes (the stuff inside the square brackets) only the - and the ^ are meta characters (not even always!) and of course the brackets themselves, so there is no need to escape the . (dot)

That said, this is your adjusted regex:

Code: Select all

'/^[a-z0-9]+([_.-][a-z0-9]+)*@([a-z0-9]+([.-][a-z0-9]+)*)+\.[a-z]{2,}$/i'
Now, the reason your regex fail is because you defined the characters before the '@' should match ([_.-][a-z0-9]+)* and since the e-mail address scott_t_@hotmail.com has an underscore before the '@', it fails to match (the end can only match [a-z0-9]).

Here's how your original regex can be rewritten so that it matches the address scott_t_@hotmail.com:

Code: Select all

'/^[a-z0-9]+([-_.a-z0-9]+)*@([a-z0-9]+([.-][a-z0-9]+)*)+\.[a-z]{2,}$/i'

Re: Email matching problem

Posted: Fri May 16, 2008 5:02 am
by prometheuzz
VladSun wrote:CI uses this in its Validation class:

Code: Select all

"/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix"
...
There is no need to escape the + and in this case the - inside the character class.

Example:

Code: Select all

[-a-c]  // matches '-', 'a', 'b' or 'c'
[ABC-]  // matches 'A', 'B', 'C' or '-'
[^a]    // matches anything except 'a'
[a^+]   // matches 'a', '^' or '+'
So, the - has no special meaning at the start or at the end of a character class
And the ^ only has a special meaning at the start of a character class

Re: Email matching problem

Posted: Fri May 16, 2008 6:22 am
by VladSun
shiznatix wrote:
Although, I can see now they do not validate the domain name right (it can start with "-").
did you mean "can't"?
Ops...
Yes, I meant "can't"

Re: Email matching problem

Posted: Fri May 16, 2008 6:27 am
by VladSun
prometheuzz wrote:
VladSun wrote:CI uses this in its Validation class:

Code: Select all

"/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix"
...
There is no need to escape the + and in this case the - inside the character class.
It's so for sure :)
You may write it to: http://codeigniter.com ;)

Re: Email matching problem

Posted: Fri May 16, 2008 6:45 am
by prometheuzz
VladSun wrote:
prometheuzz wrote:
VladSun wrote:CI uses this in its Validation class:

Code: Select all

"/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix"
...
There is no need to escape the + and in this case the - inside the character class.
It's so for sure :)
You may write it to: http://codeigniter.com ;)
Well, I've never worked with that framework, so I wouldn't know what to say to the developers exactly. I mean, I don't know where they've written that regex.
But, since you are familiar with it, feel free to drop them a line if you like.

Re: Email matching problem

Posted: Fri May 16, 2008 12:56 pm
by GeertDD
There is more "wrong" with that CodeIgniter regex. Remember that as long as the regex does not reach a full match it keeps backtracking until every possible combination has been tried out. This could take a long time because of the (.+)* pattern before the ampersand. Also known as exponential matching. Prevent it by using possessive quantifiers.

Re: Email matching problem

Posted: Fri May 16, 2008 3:35 pm
by shiznatix
Ok so all together now, what would be the one regex for this to rule them all? I don't mind if some non-ok emails get through so it can be a bit loose.

Re: Email matching problem

Posted: Fri May 16, 2008 3:53 pm
by prometheuzz
shiznatix wrote:Ok so all together now, what would be the one regex for this to rule them all? I don't mind if some non-ok emails get through so it can be a bit loose.
A loose match:

Code: Select all

#!/usr/bin/php
<?php
// http://www.regular-expressions.info/email.html
$address = 'scott_t_@hotmail.com';
if(preg_match('/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}$/i', $address)) {
    print "$address is valid\n";
} else {
    print "$address is  not valid\n";
}
?>

Re: Email matching problem

Posted: Sat May 17, 2008 1:30 am
by GeertDD
prometheuzz wrote:

Code: Select all

/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}$/i
Why not make the first part possessive? The character class cannot match "@", so there is no point in backtracking all the way to the beginning to recheck for an "@".

For example if you provide an invalid email string like "no.ampersand". What will happen? At first the character class will match as much as possible because + is greedy. So it matches the whole string: "no.ampersand". Then it looks for "@" which is not found. Okay, let's start backtracking the regex thinks because + only requires one character. So it will match "no.ampersan" and look for "@" again. Of course, there won't be an "@" ever. It continues to match "no.ampersa", "no.ampers", ..., "no", "n". All useless work which you prevent by making the first character class possessive.

Note that you cannot make the the second character class possessive. It would consume the domain part and fail immediately.

Code: Select all

/^[a-z0-9._%+-]++@[a-z0-9.-]+\.[a-z]{2,4}$/i
The problem is that now the first part of the email can start and/or end with a dot. One way to prevent this is using lookaround.

Code: Select all

/^(?!\.)[a-z0-9._%+-]++(?<!\.)@[a-z0-9.-]+\.[a-z]{2,4}$/i

Re: Email matching problem

Posted: Sat May 17, 2008 2:22 am
by prometheuzz
GeertDD wrote:
prometheuzz wrote:

Code: Select all

/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}$/i
Why not make the first part possessive? The character class cannot match "@", so there is no point in backtracking all the way to the beginning to recheck for an "@".
Two reasons:
1 - For clarity. The OP seems to be not very familiar with regex, and asked for a loose pattern. And since you already mentioned possessive quantifiers in a previous reply, the OP can Google and find out what this does to his regex.
2 - Since validating e-mail addresses will not be done on large strings*, the time wasted on backtracking when an invalid addresses entered will be next to nothing.

* the text field where the user enters his/her e-mail address should be restricted. No multi line and a fixed width.