What you need to know about \w
Posted: Tue Oct 28, 2008 1:59 pm
\w
I would like to start a discussion about what \w matches exactly. Go ahead.
I would like to start a discussion about what \w matches exactly. Go ahead.
A community of PHP developers offering assistance, advice, discussion, and friendship.
http://forums.devnetwork.net/
I'm too lazy to get of the couch and pick up Friedl from my bookshelf, but AFAIK, \w (alphanumerics) is LOCALE dependant. And it probably also depends on which regex implementation you're using.GeertDD wrote:\w
I would like to start a discussion about what \w matches exactly. Go ahead.
It's the same with \d, if I'm not mistaken. It (almost) always matches the ASCII set [0-9], but might also match for example Arabic, or Japanese numbers, depending on the LOCALE of the machine.GeertDD wrote:You're going straight where I wanted to go, prometheuzz. Good stuff. I've started this topic because I'm still confused by the exact meaning of \w and I avoid to use it.
Too often I hear people say \w equals [a-zA-Z0-9_]. However, it is not as simple as that. There are different implementations in different regex flavors, but let's just focus on PCRE for starters.
Freaky! Could you post the test? I'd like to run it on my machine(s) as well. My test with Java (1.5 and 1.6) do only match [a-zA-Z0-9_].GeertDD wrote:One thing that is clear is that \w matches the same regardless of whether PCRE has been compiled with Unicode support or not. \w never takes into account all Unicode alphanumerics.
That said, \w does match more letters than just a-z. Here is a test I did: http://pastie.textmate.org/302382. Now, why is that?
Well, it took some time, but I'm of my couch (it's an awfully comfortable couch!). Your little test and Friedl's remark at the end of page 120 amaze me!GeertDD wrote:Oh, and when you get up out of the couch, I'm not even sure whether Friedl gets it right (p.120 at the bottom). He seems to argue that \w does equal [a-zA-Z0-9_] in PCRE.
Here's the test:prometheuzz wrote:Freaky! Could you post the test? I'd like to run it on my machine(s) as well. My test with Java (1.5 and 1.6) do only match [a-zA-Z0-9_].
Code: Select all
// This outputs a list of all ASCII characters (256).
// Note that from 257 and above \w matches nothing anymore.
for ($i = 0; $i < 257; $i++)
{
echo "&#$i;";
}
// I copied the string generated above to this variable.
$ascii = '
!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€?‚ƒ„…†‡ˆ‰Š‹Œ?Ž??‘’“”•–—˜™š›œ?žŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ?';
// Now match all \w characters from that string and show the matches.
preg_match_all('~\w~', $ascii, $matches);
print_r($matches);
// And the same test with Unicode modifier.
preg_match_all('~\w~u', $ascii, $matches);
print_r($matches);
Code: Select all
<?php
// I removed the non-printable chars...
$ascii = '!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€?‚ƒ„…†‡ˆ‰Š‹Œ?Ž??‘’“”•–—˜™š›œ?žŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ?';
preg_match_all('~\w~', $ascii, $matches);
print_r($matches);
preg_match_all('~\w~u', $ascii, $matches);
print_r($matches);
?> Code: Select all
Array
(
[0] => Array
(
[0] => 0
[1] => 1
[2] => 2
[3] => 3
[4] => 4
[5] => 5
[6] => 6
[7] => 7
[8] => 8
[9] => 9
[10] => A
[11] => B
[12] => C
[13] => D
[14] => E
[15] => F
[16] => G
[17] => H
[18] => I
[19] => J
[20] => K
[21] => L
[22] => M
[23] => N
[24] => O
[25] => P
[26] => Q
[27] => R
[28] => S
[29] => T
[30] => U
[31] => V
[32] => W
[33] => X
[34] => Y
[35] => Z
[36] => _
[37] => a
[38] => b
[39] => c
[40] => d
[41] => e
[42] => f
[43] => g
[44] => h
[45] => i
[46] => j
[47] => k
[48] => l
[49] => m
[50] => n
[51] => o
[52] => p
[53] => q
[54] => r
[55] => s
[56] => t
[57] => u
[58] => v
[59] => w
[60] => x
[61] => y
[62] => z
)
)Code: Select all
bart@kerberos:~$ php -version
PHP 5.2.4-2ubuntu5.3 with Suhosin-Patch 0.9.6.2 (cli) (built: Jul 23 2008 06:46:18)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend TechnologiesCode: Select all
Multibyte Support => enabled
Multibyte string engine => libmbfl
Multibyte (japanese) regex support => enabled
Multibyte regex (oniguruma) version => 4.4.4
Multibyte regex (oniguruma) backtrack check => On
PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 7.4 2007-09-21My guess is that my local PCRE version has been compiled with some extra locale, bleh. Any way I can see that anywhere?LOCALE SUPPORT
PCRE handles caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables, indexed
by character value. When running in UTF-8 mode, this applies only to
characters with codes less than 128. Higher-valued codes never match
escapes such as \w or \d, but can be tested with \p if PCRE is built
with Unicode character property support. The use of locales with Uni-
code is discouraged. If you are handling characters with codes greater
than 128, you should either use UTF-8 and Unicode, or use locales, but
not try to mix the two.
PCRE contains an internal set of tables that are used when the final
argument of pcre_compile() is NULL. These are sufficient for many
applications. Normally, the internal tables recognize only ASCII char-
acters. However, when PCRE is built, it is possible to cause the inter-
nal tables to be rebuilt in the default "C" locale of the local system,
which may cause them to be different.
The internal tables can always be overridden by tables supplied by the
application that calls PCRE. These may be created in a different locale
from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away. [...]
From: http://pcre.org/pcre.txt
And thank you too for the info!GeertDD wrote:Thanks for your findings, prometheuzz. I'm on PHP 5.2.6 and PCRE 7.8. However, I don't think the different result is caused by the slightly different versions. I tried the same script on my external webhost (with PCRE 7.6) and the results are the same as yours.
The good news is that I found out in the PCRE manual that there is a locale setting available:
My guess is that my local PCRE version has been compiled with some extra locale, bleh. I really am glad to read that locale support is being discouraged because it is a very vaguely documented and UTF-8 is the way to go.LOCALE SUPPORT
PCRE handles caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables, indexed
by character value. When running in UTF-8 mode, this applies only to
characters with codes less than 128. Higher-valued codes never match
escapes such as \w or \d, but can be tested with \p if PCRE is built
with Unicode character property support. The use of locales with Uni-
code is discouraged. If you are handling characters with codes greater
than 128, you should either use UTF-8 and Unicode, or use locales, but
not try to mix the two.
PCRE contains an internal set of tables that are used when the final
argument of pcre_compile() is NULL. These are sufficient for many
applications. Normally, the internal tables recognize only ASCII char-
acters. However, when PCRE is built, it is possible to cause the inter-
nal tables to be rebuilt in the default "C" locale of the local system,
which may cause them to be different.
The internal tables can always be overridden by tables supplied by the
application that calls PCRE. These may be created in a different locale
from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away. [...]
From: http://pcre.org/pcre.txt
I'm with you on that!GeertDD wrote:The sad conclusion is... If you are using PCRE do not use \w if you want your regexes to be portable. Right?