Page 1 of 1

Simple locale aware regex

Posted: Sun Nov 09, 2008 7:56 am
by alex.barylski
I have the following simple regex which I want to make locale aware:

Code: Select all

 class _Alpha{
    public function filterMe($value)
    {
      return preg_replace('/[^0-9\.\+\-]/', '', $value);  
    }  
  }
The problem is...the regex would not work on International numbers as the format and symbols might change...for example a comma might be used in some parts of the world, whereas others, nothing might be displayed.

I know I could use regex built \w or \d to indicate a digit but how would I express a numeric value (whether it be a float, currency, etc) in a Internation friendly way?

Cheers,
Alex

Re: Simple locale aware regex

Posted: Sun Nov 09, 2008 11:04 am
by prometheuzz
Err, did you read the responses to your other topics? The regex you now posted still contains unnecessary escapes.
viewtopic.php?f=38&t=90567
viewtopic.php?f=50&t=90557

Re: Simple locale aware regex

Posted: Sun Nov 09, 2008 6:10 pm
by alex.barylski
After some Googling I found the following:

Code: Select all

return preg_replace('/[\p{Z}]+$/u', '', $value);
While it works it also seems to ignore new lines so I guess newlines in unicode are not consider whitespace.

--- END EDIT ---

I found the following article:

http://www.regular-expressions.info/unicode.html

Really helped clear things up.

The only problem I see now is that instead of using trim() I should probably use Regex to strip trailing whitespace using something like:

Code: Select all

// Filter trailing whitespace
$subject = preg_replace('/\p{Z}/', '', $subject);
I believe this will stripp *all* whitespace though and all I really want to do is trim trailing whitespace...so how do you start at the end of a string and strip backwards until a non-whitespace charatcer is found?

Pardon all the questions, but Unicode and regex are two areas which I need to really brush up on. The problem with regex is it is such a write once use over and over again, so unless you constantly use it, you forget it, which is what I experience.

Cheers,
Alex

Re: Simple locale aware regex

Posted: Mon Nov 10, 2008 2:51 am
by prometheuzz
Hi Alex,
PCSpectra wrote:
While it works it also seems to ignore new lines so I guess newlines in unicode are not consider whitespace.
With the /u flag added, the regex engine will treat the string as UTF-8, not ASCII. So it appears that different white space characters are at play then. Note that I know just a little regex, and are no Unicode-crack.
PCSpectra wrote:
The only problem I see now is that instead of using trim() I should probably use Regex to strip trailing whitespace using something like:
Why? The trim(...) function will do just fine, AFAIK.

Re: Simple locale aware regex

Posted: Mon Nov 10, 2008 5:57 am
by GeertDD
prometheuzz wrote:Why? The trim(...) function will do just fine, AFAIK.
In 99% of the cases. Here are the characters that trim() trims:
" " (ASCII 32 (0x20)), an ordinary space.
"\t" (ASCII 9 (0x09)), a tab.
"\n" (ASCII 10 (0x0A)), a new line (line feed).
"\r" (ASCII 13 (0x0D)), a carriage return.
"\0" (ASCII 0 (0x00)), the NUL-byte.
"\x0B" (ASCII 11 (0x0B)), a vertical tab.
Unicode contains more whitespace characters, for example:
0020: SPACE
00A0: NO-BREAK SPACE
1680: ? OGHAM SPACE MARK
180E: ? MONGOLIAN VOWEL SEPARATOR
2000: ? EN QUAD
2001: ? EM QUAD
2002: ? EN SPACE
2003: ? EM SPACE
2004: ? THREE-PER-EM SPACE
2005: ? FOUR-PER-EM SPACE
2006: ? SIX-PER-EM SPACE
2007: ? FIGURE SPACE
2008: ? PUNCTUATION SPACE
2009: ? THIN SPACE
200A: ? HAIR SPACE
202F: ? NARROW NO-BREAK SPACE
205F: ? MEDIUM MATHEMATICAL SPACE
3000: ? IDEOGRAPHIC SPACE
This trims trailing whitespace:

Code: Select all

$subject = preg_replace('/\p{Z}+\z/u', '', $subject);
Edit: I just noticed that \pZ does not include all the ascii whitespace matches by trim(), e.g. a tab character. This complicates things a bit more. You'll need to create a custom character class.

Re: Simple locale aware regex

Posted: Mon Nov 10, 2008 6:10 am
by prometheuzz
Thanks for the info Geert!