Internationlized regex

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Internationlized regex

Post by alex.barylski »

In regards to this forum:

viewtopic.php?f=50&t=90557

It occured to me that those funcitons would not be very Internationally friendly with the characters [0-9] hardcoded into the expression.

Sure I could use \d (or whatever digit in RE is) but what about + or - and periods or comma's???
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Internationlized regex

Post by prometheuzz »

PCSpectra wrote:...

Sure I could use \d (or whatever digit in RE is) but what about + or - and periods or comma's???
\d probably almost always is the ascii character set [0-9]. I say probably because it depends on how the regex engine is compiled. Also see this thread about this:
viewtopic.php?f=38&t=90050

So, IMO, if you want to be sure to match more than just [0-9], don't rely on \d, but specify exactly what you want to match (using Unicode-codes).
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Internationlized regex

Post by GeertDD »

So, don't forget to apply the Unicode modifier: u.

I recommend \p{Nd} (Number decimal). Just using \p{N} (Number) will propably match too many characters. For example, it also includes many dingbats like ❶ ② ➌.

A very practical tool to look up these kind of things is UniView: http://people.w3.org/rishida/scripts/uniview/
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Internationlized regex

Post by alex.barylski »

So, IMO, if you want to be sure to match more than just [0-9], don't rely on \d, but specify exactly what you want to match (using Unicode-codes).
That is what I was afraid of...
GeertDD wrote:So, don't forget to apply the Unicode modifier: u.

I recommend \p{Nd} (Number decimal). Just using \p{N} (Number) will propably match too many characters. For example, it also includes many dingbats like ❶ ② ➌.

A very practical tool to look up these kind of things is UniView: http://people.w3.org/rishida/scripts/uniview/
Cool :)
Post Reply