Page 1 of 1

Internationlized regex

Posted: Sun Nov 09, 2008 1:19 am
by alex.barylski
In regards to this forum:

viewtopic.php?f=50&t=90557

It occured to me that those funcitons would not be very Internationally friendly with the characters [0-9] hardcoded into the expression.

Sure I could use \d (or whatever digit in RE is) but what about + or - and periods or comma's???

Re: Internationlized regex

Posted: Sun Nov 09, 2008 3:05 am
by prometheuzz
PCSpectra wrote:...

Sure I could use \d (or whatever digit in RE is) but what about + or - and periods or comma's???
\d probably almost always is the ascii character set [0-9]. I say probably because it depends on how the regex engine is compiled. Also see this thread about this:
viewtopic.php?f=38&t=90050

So, IMO, if you want to be sure to match more than just [0-9], don't rely on \d, but specify exactly what you want to match (using Unicode-codes).

Re: Internationlized regex

Posted: Sun Nov 09, 2008 6:17 am
by GeertDD
So, don't forget to apply the Unicode modifier: u.

I recommend \p{Nd} (Number decimal). Just using \p{N} (Number) will propably match too many characters. For example, it also includes many dingbats like ❶ ② ➌.

A very practical tool to look up these kind of things is UniView: http://people.w3.org/rishida/scripts/uniview/

Re: Internationlized regex

Posted: Sun Nov 09, 2008 4:52 pm
by alex.barylski
So, IMO, if you want to be sure to match more than just [0-9], don't rely on \d, but specify exactly what you want to match (using Unicode-codes).
That is what I was afraid of...
GeertDD wrote:So, don't forget to apply the Unicode modifier: u.

I recommend \p{Nd} (Number decimal). Just using \p{N} (Number) will propably match too many characters. For example, it also includes many dingbats like ❶ ② ➌.

A very practical tool to look up these kind of things is UniView: http://people.w3.org/rishida/scripts/uniview/
Cool :)