use of \w with utf8 strings

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

use of \w with utf8 strings

Post by dml »

I can't get the metacharacter \w to match utf8-encoded letters (apart from [a-zA-Z]). Am I doing something wrong, or is it the case that \w just isn't supposed to match everything that Unicode categorises as a letter?

Code: Select all

 
 
function assert_equals($expected, $got, $message = ""){
  if($got!==$expected){
    var_dump($message, "EXPECTED:", $expected, "GOT:", $got);
    exit(1);
  }
}
function test($locale, $regex, $string, $expected_result){
  assert_equals($locale, setlocale(LC_CTYPE, $locale), "setlocale");
  assert_equals($expected_result, preg_match($regex, $string), array($locale, $regex, $string, bin2hex($string)));
}
 
$iso_8859_string = "\xe9"; // é
$utf8_string = "\xc3\xa9"; // é
$match_one_w = '/^\w$/';
 
// C Locale: doesn't match (as expected)
test("C", $match_one_w, $iso_8859_string, 0);
test("C", $match_one_w, $utf8_string, 0);
 
// fr_FR locale: only matches when iso-8859-1 encoded (as expected)
test("fr_FR", $match_one_w, $iso_8859_string, 1);
test("fr_FR", $match_one_w, $utf8_string, 0);
 
// fr_FR.UTF8 locale
test("fr_FR.UTF8", $match_one_w, $iso_8859_string, 0); // since it's smart enough not to match here
test("fr_FR.UTF8", $match_one_w, $utf8_string, 0); // might have expected a match here
 
// Unicode regex always matches utf8 string, regardless of locale?
test("C", '/^\pL$/u', $utf8_string, 1);
test("fr_FR", '/^\pL$/u', $utf8_string, 1);
test("fr_FR.UTF8", '/^\pL$/u', $utf8_string, 1); 
 
 
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: use of \w with utf8 strings

Post by Weirdan »

phpwact.org wrote: Using the /u pattern modifier prevents words from being mangled but instead PCRE skips strings of characters with code values greater than 127. Therefore, \w will not match a multibyte (non-lower ascii) word at all (but also won’t return portions of it). From the pcrepattern man page;

In UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. This is true even when Unicode character property support is available.
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: use of \w with utf8 strings

Post by dml »

Thanks Weirdan, that's the information I was looking for. I wasn't able to find it in the php.net manual, whereas it's clearly and unambiguously specified in the PCRE man pages.

As a general rule, would I be correct in assuming that I can use the PCRE man pages as an authoritative specification of the behavior of preg_* regular expressions in PHP? If I understand correctly, PHP just passes the preg_* calls down to the PCRE library so that should be the case, shouldn't it?
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: use of \w with utf8 strings

Post by Weirdan »

dml wrote: As a general rule, would I be correct in assuming that I can use the PCRE man pages as an authoritative specification of the behavior of preg_* regular expressions in PHP? If I understand correctly, PHP just passes the preg_* calls down to the PCRE library so that should be the case, shouldn't it?
I think so - but make sure you're looking at the man page, corresponding to PCRE version used in PHP. PHP usually lags behind several versions.
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: use of \w with utf8 strings

Post by dml »

Thanks, that makes sense. Phpinfo shows PCRE version 6.7.7.4 for my system, and it looks like the latest version is 7.8.
Post Reply