\w
I would like to start a discussion about what \w matches exactly. Go ahead.
Moderator: General Moderators
GeertDD wrote:\w
I would like to start a discussion about what \w matches exactly. Go ahead.
GeertDD wrote:You're going straight where I wanted to go, prometheuzz. Good stuff. I've started this topic because I'm still confused by the exact meaning of \w and I avoid to use it.
Too often I hear people say \w equals [a-zA-Z0-9_]. However, it is not as simple as that. There are different implementations in different regex flavors, but let's just focus on PCRE for starters.
GeertDD wrote:One thing that is clear is that \w matches the same regardless of whether PCRE has been compiled with Unicode support or not. \w never takes into account all Unicode alphanumerics.
That said, \w does match more letters than just a-z. Here is a test I did: http://pastie.textmate.org/302382. Now, why is that?
GeertDD wrote:Oh, and when you get up out of the couch, I'm not even sure whether Friedl gets it right (p.120 at the bottom). He seems to argue that \w does equal [a-zA-Z0-9_] in PCRE.
prometheuzz wrote:Freaky! Could you post the test? I'd like to run it on my machine(s) as well. My test with Java (1.5 and 1.6) do only match [a-zA-Z0-9_].
LOCALE SUPPORT
PCRE handles caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables, indexed
by character value. When running in UTF-8 mode, this applies only to
characters with codes less than 128. Higher-valued codes never match
escapes such as \w or \d, but can be tested with \p if PCRE is built
with Unicode character property support. The use of locales with Uni-
code is discouraged. If you are handling characters with codes greater
than 128, you should either use UTF-8 and Unicode, or use locales, but
not try to mix the two.
PCRE contains an internal set of tables that are used when the final
argument of pcre_compile() is NULL. These are sufficient for many
applications. Normally, the internal tables recognize only ASCII char-
acters. However, when PCRE is built, it is possible to cause the inter-
nal tables to be rebuilt in the default "C" locale of the local system,
which may cause them to be different.
The internal tables can always be overridden by tables supplied by the
application that calls PCRE. These may be created in a different locale
from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away. [...]
From: http://pcre.org/pcre.txt
GeertDD wrote:Thanks for your findings, prometheuzz. I'm on PHP 5.2.6 and PCRE 7.8. However, I don't think the different result is caused by the slightly different versions. I tried the same script on my external webhost (with PCRE 7.6) and the results are the same as yours.
The good news is that I found out in the PCRE manual that there is a locale setting available:LOCALE SUPPORT
PCRE handles caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables, indexed
by character value. When running in UTF-8 mode, this applies only to
characters with codes less than 128. Higher-valued codes never match
escapes such as \w or \d, but can be tested with \p if PCRE is built
with Unicode character property support. The use of locales with Uni-
code is discouraged. If you are handling characters with codes greater
than 128, you should either use UTF-8 and Unicode, or use locales, but
not try to mix the two.
PCRE contains an internal set of tables that are used when the final
argument of pcre_compile() is NULL. These are sufficient for many
applications. Normally, the internal tables recognize only ASCII char-
acters. However, when PCRE is built, it is possible to cause the inter-
nal tables to be rebuilt in the default "C" locale of the local system,
which may cause them to be different.
The internal tables can always be overridden by tables supplied by the
application that calls PCRE. These may be created in a different locale
from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away. [...]
From: http://pcre.org/pcre.txt
My guess is that my local PCRE version has been compiled with some extra locale, bleh. I really am glad to read that locale support is being discouraged because it is a very vaguely documented and UTF-8 is the way to go.
GeertDD wrote:The sad conclusion is... If you are using PCRE do not use \w if you want your regexes to be portable. Right?
Users browsing this forum: No registered users and 1 guest