Which functions for unicode matching/stripping (and more)?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
UrbanFuturistic
Forum Newbie
Posts: 3
Joined: Thu Dec 24, 2009 7:08 am

Which functions for unicode matching/stripping (and more)?

Post by UrbanFuturistic »

OK, I apologise in advance if these questions have already been asked (I expect they have) but I've a lot on and my searching skills are failing me right now (yep, I've used the 'search' function of this board).

I'm a little confused as to whether preg with the /u switch or mb_ereg should be used for matching/validating/stripping input. Is one of these deprecated? Is one going to be kept and the other dropped? Are both going in favour of something new? The preg examples I've seen have a lot more granular control over which unicode characters can be matched, is this something which can be done equally finely if I use something like Perl syntax in the mb_ereg family?

Additionally to this, and I don't know which would be the best forum for this second question, are there any resources which say what a valid unicode/IDN e-mail address looks like? The old ASCII version is so well documented even Wikipedia has all the info anyone needs on that (including the difference between the standard and what's actually accepted) but I can't find diddly on international. Has there in fact been no change other than what counts as a letter?

Finally (spot the noob, right?), as unlikely as I find it, I'd just like a little reassurance; is there any way a stored character in a string holding variable can interfere with the running of PHP code? I don't mean if it's passed to the command line or a database in/as a SQL command, just if it's being manipulated in PHP and maybe output to a text file or e-mailed.

Hopefully you're not all reeling from the avalanche of stupid :?
MichaelR
Forum Contributor
Posts: 148
Joined: Sat Jan 03, 2009 3:27 pm

Re: Which functions for unicode matching/stripping (and more)?

Post by MichaelR »

Use preg_match with the /u modifier. Preg functions use PCRE, whilst Ereg functions use POSIX. PCRE is the preferred over POSIX.
Additionally to this, and I don't know which would be the best forum for this second question, are there any resources which say what a valid unicode/IDN e-mail address looks like? The old ASCII version is so well documented even Wikipedia has all the info anyone needs on that (including the difference between the standard and what's actually accepted) but I can't find diddly on international. Has there in fact been no change other than what counts as a letter?
Internationalized domain names are converted to punycode. All internationalized domain names begin xn--. For example, http://xn--tdali-d8a8w.lv/ is the converted form of http://tūdaliņ.lv.

I actually go over this in an article on email address validation. The link's in my signature.
UrbanFuturistic
Forum Newbie
Posts: 3
Joined: Thu Dec 24, 2009 7:08 am

Re: Which functions for unicode matching/stripping (and more)?

Post by UrbanFuturistic »

OK, thanks for replying. That's quite easy to follow but, I'm still left not knowing a few more things. I don't necessarily expect you to answer these but if you have any further reading (apart from the obvious IDN, e-mail and international e-mail pages on Wikipedia) that would be useful I'd be very grateful.

1) After a little more research I've found that the RFCs for international e-mails are unfinished so I'm guessing there's not yet any point in checking the local part for anything outside of US-ASCII?

2) I've noticed a lot (haven't checked this against your work yet) of regexp examples for this only select a subset of all possible word/letter characters. Is there a reason for this and is there a reason you've used hex designations rather than \p{L} (for example)?

3) Given that some quite major e-mail server setups refuse to handle addresses with much outside of hyphen, underscore or single periods (thereby telling the RFCs to go stuff their carets and curly brackets), how is this usually addressed?
Post Reply