Page 1 of 2

pregex matching for valid email

Posted: Fri Aug 08, 2003 4:15 am
by Heavy
I wrote a regex matching command to check for valid email addresses.
It works fine, but I'm not sure whether what I think is a valid email address is correct:

Code: Select all

<?php
$boolValid = preg_match(	'/(?i)^(їa-z0-9.\-_]+@їa-z0-9.\-_]+)$/',$email);
?>
This matches any email adress containing (any case) alphanumeric, dot, hyphen and underscore. Are there more characters I should allow in this pattern?

Posted: Fri Aug 08, 2003 4:18 am
by Tubbietoeter
have a look at here:
http://de3.php.net/manual/en/function.preg-match.php

look at the user comments

Posted: Fri Aug 08, 2003 8:17 am
by Heavy
I made some modifications to the example shown in the comments.

Here is the result:

Code: Select all

<?php
function valid_email_syntax($email){
return(preg_match(	'/(?i)'.
					'^(([a-z0-9\-_]\.?)+[a-z0-9\-_]@'.
					'('.
						'('.
							'(([a-z0-9\-_])+\.)*(ad|ae|aero|af|ag|ai|al|am|an|ao|aq|ar|arpa|as|at|au|aw|az|'.
							'ba|bb|bd|be|bf|bg|bh|bi|biz|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|'.
							'ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|com|coop|cr|cs|cu|cv|cx|cy|cz|'.
							'de|dj|dk|dm|do|dz|ec|edu|ee|eg|eh|er|es|et|eu|'.
							'fi|fj|fk|fm|fo|fr|'.
							'ga|gb|gd|ge|gf|gh|gi|gl|gm|gn|gov|gp|gq|gr|gs|gt|gu|gw|gy|'.
							'hk|hm|hn|hr|ht|hu|id|ie|il|in|info|int|io|iq|ir|is|it|'.
							'jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|'.
							'la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|mg|mh|mil|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|museum|mv|mw|mx|my|mz|'.
							'na|name|nc|ne|net|nf|ng|ni|nl|no|np|nr|nt|nu|nz|om|org|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|pro|ps|pt|pw|py|qa|re|ro|ru|rw|'.
							'sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|um|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw'.
							')$'.
						')'.
					')'.
					'|'.
					'(([0-9][0-9]?|[0-1][0-9][0-9]|[2][0-4][0-9]|[2][5][0-5])\.){3}([0-9][0-9]?|[0-1][0-9][0-9]|[2][0-4][0-9]|[2][5][0-5]))$/i',$email));
}

?>

Posted: Fri Aug 08, 2003 10:16 am
by m3rajk
i don't feel like searching for it. i posted a variation onf one in a perl book. ff the top of my head i remember the following: there's a rfc that has a list of all possible ending domains.

and this is what i used, forumlated based on the fact i wanna weed out obvious bad e-mails... it only allows what A-Za-z0-0_ (\w) . (\.) and hyphen in the address, which has to end with 2 or 3 letters

Code: Select all

preg_match('/^[\w\.\-]+@[\w\.\-]+\.\w\w\w?$/', $email)
btw: i do allow one thing you don't: CAPITAL LETTERS

and for future use, these will be really helpful:

PERL SHORTS
\w = [A-Zaz09_]
\s = ALL WHITE SPACE
\d = [0-9]

the same thing but with capital letters will give you the opposite of those. case DOES matter with perl shorts

Posted: Fri Aug 08, 2003 1:53 pm
by Heavy
In humble response to m3rajk:

The "(?i)" part of my pattern makes it case insensitive...

I wanted to strike out all scandinavian characters from beeing matched. That's why I didn't use any \w or [:alpha:] or such locale dependent things. However, I don't know how locale dependent \w is.

I started out doing regex one month ago, so I am not yet very guru with it. But I have tested my pattern thoroughly and it pleases me for the moment. I have not yet found any bug in it.
Tell me if you do...

My pattern disallows email names that start or end with a dot and that applies to the domain name as well. I think it is good enough for me, but I agree I don't HAVE to test for existing top domain names. As well, testing for existing top domain names makes me need to be updated on new top domain names when they arrive. I might remove it, but it works right now...

Posted: Fri Aug 08, 2003 3:00 pm
by mikusan
While on topic could anyone help me with mine:
For some obscure reason i cannot figure out why it will not accept emails that are like abc@123.server.ca
I have added some trash but i still can't get it working...thanks for the help!!

Code: Select all

$str = ereg_replace( "^[0-9a-z]([-_.]?[0-9a-z])*@[0-9a-z]([-.]?[0-9a-z])*[0-9a-z]([-.]?[0-9a-z])*.[a-z]{3,4}$", "<a href="mailto:\\0">\\0</a>", $str );

Posted: Fri Aug 08, 2003 3:01 pm
by mikusan
yes i feel foolish to say... i am trying to do exactly what...well PHPBB just did up top ;)

Posted: Fri Aug 08, 2003 3:03 pm
by Heavy
mikusan wrote:yes i feel foolish to say... i am trying to do exactly what...well PHPBB just did up top ;)
I don't understand that at all... What do you mean?

Posted: Fri Aug 08, 2003 3:37 pm
by Heavy
Doh! I didn't see your first post!

when I see this (and have'nt tested it myself though):

Code: Select all

<?php
 "^[0-9a-z]([-_.]?[0-9a-z])*@[0-9a-z]([-.]?[0-9a-z])*[0-9a-z]([-.]?[0-9a-z])*.[a-z]{3,4}$"
?>
I think:
Shouldn't that dot be escaped?

Try:

Code: Select all

<?php
"^[0-9a-z]([-_\.]?[0-9a-z])*@[0-9a-z]([-\.]?[0-9a-z])*[0-9a-z]([-\.]?[0-9a-z])*\.?[a-z]{3,4}$"
?>

Posted: Fri Aug 08, 2003 4:44 pm
by mikusan
Nope dots are to be escaped only when you use single quotes, also my regex works fine with emails like me@123.com but not with me@123.whatever.co.uk... i would be happy with me@123.something.something....

Posted: Fri Aug 08, 2003 6:43 pm
by McGruff
I'm not too hot on regex but this tool might help with refining your expressions, by making it quicker to test ideas:

http://www.weitz.de/regex-coach/#install

Posted: Sat Aug 09, 2003 12:38 pm
by m3rajk
Heavy wrote:In humble response to m3rajk:

The "(?i)" part of my pattern makes it case insensitive...

I wanted to strike out all scandinavian characters from beeing matched. That's why I didn't use any \w or [:alpha:] or such locale dependent things. However, I don't know how locale dependent \w is.

I started out doing regex one month ago, so I am not yet very guru with it. But I have tested my pattern thoroughly and it pleases me for the moment. I have not yet found any bug in it.
Tell me if you do...

My pattern disallows email names that start or end with a dot and that applies to the domain name as well. I think it is good enough for me, but I agree I don't HAVE to test for existing top domain names. As well, testing for existing top domain names makes me need to be updated on new top domain names when they arrive. I might remove it, but it works right now...
[:alpha:] is purely posix. and \w is perl. perl is not local dependant. the shorts pertain to ascii stretches. it will only get what i mentioned. and ? is not needed after the end delimiter, and i'm not sure that it'll be case insensitive. remember, perl and posix are completely differnet. what you use for posix may not work for perl and visa versa. for case insenitivity in perl, using | as a delimiter, with global search/replacement, it's |pattern|gi or |pattern|ig

mine does NEARLY the same as yours. it doesn't stop the name from starting with a . but it's legal to start an e-mail address with a . not common but legal, and you can prevent that and a hypen by adding \w to mine to get

Code: Select all

preg_match('/^\w[\w\.\-]+@[\w\.\-]+\.\w\w\w?$/', $email)
giving you everything but the top level

mikusan: i'd suggest using perl for this. it's MUCH more elegant than posix.

just use my match check. it will allow that.
breaking down yours it's
any character 0-9 or a-z (not A-Z)
-_. optional but must have the same pattern as the first line 0 or more times
@
some weird thing based on the first half

and that's a replace... without capturing anything.

tell us what you're trying to do, until then the only good advice anyone can give you is to use my match

Posted: Sat Aug 09, 2003 3:49 pm
by nielsene
mikusan wrote:While on topic could anyone help me with mine:
For some obscure reason i cannot figure out why it will not accept emails that are like abc@123.server.ca
I have added some trash but i still can't get it working...thanks for the help!!

Code: Select all

$str = ereg_replace( "^[0-9a-z]([-_.]?[0-9a-z])*@[0-9a-z]([-.]?[0-9a-z])*[0-9a-z]([-.]?[0-9a-z])*.[a-z]{3,4}$", "<a href="mailto:\\0">\\0</a>", $str );
Your {3,4} stops it from matching all the two character country codes. It only matches the org/com/mil/net/info/coop/edu type top levels.

Posted: Sat Aug 09, 2003 7:20 pm
by m3rajk
i figured i don't need to care about coop, info or museum becuase i've never seen anyone use them. and all the museums i know are all .org.

Posted: Sun Aug 10, 2003 5:10 pm
by Heavy
m3rajk wrote:and i'm not sure that it'll be case insensitive
Hmm... But it works well...
I got it from here:
http://www.php.net/manual/en/pcre.pattern.syntax.php

Read about:
Internal option setting



I got the [:alpha:] from some tutorial, but won't use it again since it turned out to match scandinavian characters differently depending on installed locale. I will check out how locale dependent \w is.

Thanks for the tips though. :wink: I'm really newbie on regex, and have only used it with PHP so far.