Page 1 of 1

use of \w with utf8 strings

Posted: Fri Sep 05, 2008 3:32 pm
by dml
I can't get the metacharacter \w to match utf8-encoded letters (apart from [a-zA-Z]). Am I doing something wrong, or is it the case that \w just isn't supposed to match everything that Unicode categorises as a letter?

Code: Select all

 
 
function assert_equals($expected, $got, $message = ""){
  if($got!==$expected){
    var_dump($message, "EXPECTED:", $expected, "GOT:", $got);
    exit(1);
  }
}
function test($locale, $regex, $string, $expected_result){
  assert_equals($locale, setlocale(LC_CTYPE, $locale), "setlocale");
  assert_equals($expected_result, preg_match($regex, $string), array($locale, $regex, $string, bin2hex($string)));
}
 
$iso_8859_string = "\xe9"; // é
$utf8_string = "\xc3\xa9"; // é
$match_one_w = '/^\w$/';
 
// C Locale: doesn't match (as expected)
test("C", $match_one_w, $iso_8859_string, 0);
test("C", $match_one_w, $utf8_string, 0);
 
// fr_FR locale: only matches when iso-8859-1 encoded (as expected)
test("fr_FR", $match_one_w, $iso_8859_string, 1);
test("fr_FR", $match_one_w, $utf8_string, 0);
 
// fr_FR.UTF8 locale
test("fr_FR.UTF8", $match_one_w, $iso_8859_string, 0); // since it's smart enough not to match here
test("fr_FR.UTF8", $match_one_w, $utf8_string, 0); // might have expected a match here
 
// Unicode regex always matches utf8 string, regardless of locale?
test("C", '/^\pL$/u', $utf8_string, 1);
test("fr_FR", '/^\pL$/u', $utf8_string, 1);
test("fr_FR.UTF8", '/^\pL$/u', $utf8_string, 1); 
 
 

Re: use of \w with utf8 strings

Posted: Fri Sep 05, 2008 5:45 pm
by Weirdan
phpwact.org wrote: Using the /u pattern modifier prevents words from being mangled but instead PCRE skips strings of characters with code values greater than 127. Therefore, \w will not match a multibyte (non-lower ascii) word at all (but also won’t return portions of it). From the pcrepattern man page;

In UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. This is true even when Unicode character property support is available.

Re: use of \w with utf8 strings

Posted: Fri Sep 05, 2008 8:46 pm
by dml
Thanks Weirdan, that's the information I was looking for. I wasn't able to find it in the php.net manual, whereas it's clearly and unambiguously specified in the PCRE man pages.

As a general rule, would I be correct in assuming that I can use the PCRE man pages as an authoritative specification of the behavior of preg_* regular expressions in PHP? If I understand correctly, PHP just passes the preg_* calls down to the PCRE library so that should be the case, shouldn't it?

Re: use of \w with utf8 strings

Posted: Sat Sep 06, 2008 12:42 pm
by Weirdan
dml wrote: As a general rule, would I be correct in assuming that I can use the PCRE man pages as an authoritative specification of the behavior of preg_* regular expressions in PHP? If I understand correctly, PHP just passes the preg_* calls down to the PCRE library so that should be the case, shouldn't it?
I think so - but make sure you're looking at the man page, corresponding to PCRE version used in PHP. PHP usually lags behind several versions.

Re: use of \w with utf8 strings

Posted: Sat Sep 06, 2008 1:08 pm
by dml
Thanks, that makes sense. Phpinfo shows PCRE version 6.7.7.4 for my system, and it looks like the latest version is 7.8.