Simple locale aware regex

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Simple locale aware regex

Post by alex.barylski »

I have the following simple regex which I want to make locale aware:

Code: Select all

 class _Alpha{
    public function filterMe($value)
    {
      return preg_replace('/[^0-9\.\+\-]/', '', $value);  
    }  
  }
The problem is...the regex would not work on International numbers as the format and symbols might change...for example a comma might be used in some parts of the world, whereas others, nothing might be displayed.

I know I could use regex built \w or \d to indicate a digit but how would I express a numeric value (whether it be a float, currency, etc) in a Internation friendly way?

Cheers,
Alex
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Simple locale aware regex

Post by prometheuzz »

Err, did you read the responses to your other topics? The regex you now posted still contains unnecessary escapes.
viewtopic.php?f=38&t=90567
viewtopic.php?f=50&t=90557
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Simple locale aware regex

Post by alex.barylski »

After some Googling I found the following:

Code: Select all

return preg_replace('/[\p{Z}]+$/u', '', $value);
While it works it also seems to ignore new lines so I guess newlines in unicode are not consider whitespace.

--- END EDIT ---

I found the following article:

http://www.regular-expressions.info/unicode.html

Really helped clear things up.

The only problem I see now is that instead of using trim() I should probably use Regex to strip trailing whitespace using something like:

Code: Select all

// Filter trailing whitespace
$subject = preg_replace('/\p{Z}/', '', $subject);
I believe this will stripp *all* whitespace though and all I really want to do is trim trailing whitespace...so how do you start at the end of a string and strip backwards until a non-whitespace charatcer is found?

Pardon all the questions, but Unicode and regex are two areas which I need to really brush up on. The problem with regex is it is such a write once use over and over again, so unless you constantly use it, you forget it, which is what I experience.

Cheers,
Alex
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Simple locale aware regex

Post by prometheuzz »

Hi Alex,
PCSpectra wrote:
While it works it also seems to ignore new lines so I guess newlines in unicode are not consider whitespace.
With the /u flag added, the regex engine will treat the string as UTF-8, not ASCII. So it appears that different white space characters are at play then. Note that I know just a little regex, and are no Unicode-crack.
PCSpectra wrote:
The only problem I see now is that instead of using trim() I should probably use Regex to strip trailing whitespace using something like:
Why? The trim(...) function will do just fine, AFAIK.
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Simple locale aware regex

Post by GeertDD »

prometheuzz wrote:Why? The trim(...) function will do just fine, AFAIK.
In 99% of the cases. Here are the characters that trim() trims:
" " (ASCII 32 (0x20)), an ordinary space.
"\t" (ASCII 9 (0x09)), a tab.
"\n" (ASCII 10 (0x0A)), a new line (line feed).
"\r" (ASCII 13 (0x0D)), a carriage return.
"\0" (ASCII 0 (0x00)), the NUL-byte.
"\x0B" (ASCII 11 (0x0B)), a vertical tab.
Unicode contains more whitespace characters, for example:
0020: SPACE
00A0: NO-BREAK SPACE
1680: ? OGHAM SPACE MARK
180E: ? MONGOLIAN VOWEL SEPARATOR
2000: ? EN QUAD
2001: ? EM QUAD
2002: ? EN SPACE
2003: ? EM SPACE
2004: ? THREE-PER-EM SPACE
2005: ? FOUR-PER-EM SPACE
2006: ? SIX-PER-EM SPACE
2007: ? FIGURE SPACE
2008: ? PUNCTUATION SPACE
2009: ? THIN SPACE
200A: ? HAIR SPACE
202F: ? NARROW NO-BREAK SPACE
205F: ? MEDIUM MATHEMATICAL SPACE
3000: ? IDEOGRAPHIC SPACE
This trims trailing whitespace:

Code: Select all

$subject = preg_replace('/\p{Z}+\z/u', '', $subject);
Edit: I just noticed that \pZ does not include all the ascii whitespace matches by trim(), e.g. a tab character. This complicates things a bit more. You'll need to create a custom character class.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Simple locale aware regex

Post by prometheuzz »

Thanks for the info Geert!
Post Reply