What you need to know about \w

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

What you need to know about \w

Post by GeertDD »

\w

I would like to start a discussion about what \w matches exactly. Go ahead.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: What you need to know about \w

Post by prometheuzz »

GeertDD wrote:\w

I would like to start a discussion about what \w matches exactly. Go ahead.
I'm too lazy to get of the couch and pick up Friedl from my bookshelf, but AFAIK, \w (alphanumerics) is LOCALE dependant. And it probably also depends on which regex implementation you're using.

What say you?
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: What you need to know about \w

Post by GeertDD »

You're going straight where I wanted to go, prometheuzz. Good stuff. I've started this topic because I'm still confused by the exact meaning of \w and I avoid to use it.

Too often I hear people say \w equals [a-zA-Z0-9_]. However, it is not as simple as that. There are different implementations in different regex flavors, but let's just focus on PCRE for starters.

One thing that is clear is that \w matches the same regardless of whether PCRE has been compiled with Unicode support or not. \w never takes into account all Unicode alphanumerics.

That said, \w does match more letters than just a-z. Here is a test I did: http://pastie.textmate.org/302382. Now, why is that?

Oh, and when you get up out of the couch ;), I'm not even sure whether Friedl gets it right (p.120 at the bottom). He seems to argue that \w does equal [a-zA-Z0-9_] in PCRE.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: What you need to know about \w

Post by prometheuzz »

GeertDD wrote:You're going straight where I wanted to go, prometheuzz. Good stuff. I've started this topic because I'm still confused by the exact meaning of \w and I avoid to use it.

Too often I hear people say \w equals [a-zA-Z0-9_]. However, it is not as simple as that. There are different implementations in different regex flavors, but let's just focus on PCRE for starters.
It's the same with \d, if I'm not mistaken. It (almost) always matches the ASCII set [0-9], but might also match for example Arabic, or Japanese numbers, depending on the LOCALE of the machine.
GeertDD wrote:One thing that is clear is that \w matches the same regardless of whether PCRE has been compiled with Unicode support or not. \w never takes into account all Unicode alphanumerics.

That said, \w does match more letters than just a-z. Here is a test I did: http://pastie.textmate.org/302382. Now, why is that?
Freaky! Could you post the test? I'd like to run it on my machine(s) as well. My test with Java (1.5 and 1.6) do only match [a-zA-Z0-9_].
GeertDD wrote:Oh, and when you get up out of the couch ;), I'm not even sure whether Friedl gets it right (p.120 at the bottom). He seems to argue that \w does equal [a-zA-Z0-9_] in PCRE.
Well, it took some time, but I'm of my couch (it's an awfully comfortable couch!). Your little test and Friedl's remark at the end of page 120 amaze me!
For those who don't own a copy... yet (go on buy one!), here's what Friedl writes:

\w Part-of-word character Often the same as '[a-zA-Z0-9_]'. Some
Tools omit the underscore, while others include all alphanumerics
in the current locale. If Unicode is supported, \w usually refers
to all alphanumerics; notable exceptions include java.util.regex
and PCRE (and by extension, PHP), whose \w are exactly '[a-zA-Z0-9_]'.
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: What you need to know about \w

Post by GeertDD »

prometheuzz wrote:Freaky! Could you post the test? I'd like to run it on my machine(s) as well. My test with Java (1.5 and 1.6) do only match [a-zA-Z0-9_].
Here's the test:

Code: Select all

// This outputs a list of all ASCII characters (256).
// Note that from 257 and above \w matches nothing anymore.
for ($i = 0; $i < 257; $i++)
{
    echo "&#$i;";
}
 
// I copied the string generated above to this variable.
$ascii = '  

 !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€?‚ƒ„…†‡ˆ‰Š‹Œ?Ž??‘’“”•–—˜™š›œ?žŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ?';
 
// Now match all \w characters from that string and show the matches.
preg_match_all('~\w~', $ascii, $matches);
print_r($matches);
 
// And the same test with Unicode modifier.
preg_match_all('~\w~u', $ascii, $matches);
print_r($matches);
 
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: What you need to know about \w

Post by prometheuzz »

Thanks Geert.

Well, when running the following code:

Code: Select all

<?php
// I removed the non-printable chars...
$ascii = '!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€?‚ƒ„…†‡ˆ‰Š‹Œ?Ž??‘’“”•–—˜™š›œ?žŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ?';
preg_match_all('~\w~', $ascii, $matches);
print_r($matches);
preg_match_all('~\w~u', $ascii, $matches);
print_r($matches);
?>  
The same output is printed twice:

Code: Select all

Array
(
    [0] => Array
        (
            [0] => 0
            [1] => 1
            [2] => 2
            [3] => 3
            [4] => 4
            [5] => 5
            [6] => 6
            [7] => 7
            [8] => 8
            [9] => 9
            [10] => A
            [11] => B
            [12] => C
            [13] => D
            [14] => E
            [15] => F
            [16] => G
            [17] => H
            [18] => I
            [19] => J
            [20] => K
            [21] => L
            [22] => M
            [23] => N
            [24] => O
            [25] => P
            [26] => Q
            [27] => R
            [28] => S
            [29] => T
            [30] => U
            [31] => V
            [32] => W
            [33] => X
            [34] => Y
            [35] => Z
            [36] => _
            [37] => a
            [38] => b
            [39] => c
            [40] => d
            [41] => e
            [42] => f
            [43] => g
            [44] => h
            [45] => i
            [46] => j
            [47] => k
            [48] => l
            [49] => m
            [50] => n
            [51] => o
            [52] => p
            [53] => q
            [54] => r
            [55] => s
            [56] => t
            [57] => u
            [58] => v
            [59] => w
            [60] => x
            [61] => y
            [62] => z
        )
 
)
So it looks like my PHP interpreter does "think" \w equals [a-zA-Z0-9_].

For what it's worth, I test my regex-es in PHP through the command line and when executing the command 'php -version' I get the following output:

Code: Select all

bart@kerberos:~$ php -version
PHP 5.2.4-2ubuntu5.3 with Suhosin-Patch 0.9.6.2 (cli) (built: Jul 23 2008 06:46:18) 
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies
Oh, and a part of the output the command 'php -i' produced:

Code: Select all

Multibyte Support => enabled
Multibyte string engine => libmbfl
Multibyte (japanese) regex support => enabled
Multibyte regex (oniguruma) version => 4.4.4
Multibyte regex (oniguruma) backtrack check => On
 
PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 7.4 2007-09-21
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: What you need to know about \w

Post by GeertDD »

Thanks for your findings, prometheuzz. I'm on PHP 5.2.6 and PCRE 7.8. However, I don't think the different result is caused by the slightly different versions. I tried the same script on my external webhost (with PCRE 7.6) and the results are the same as yours.

The good news is that I found out in the PCRE manual that there is a locale setting available:
LOCALE SUPPORT

PCRE handles caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables, indexed
by character value. When running in UTF-8 mode, this applies only to
characters with codes less than 128. Higher-valued codes never match
escapes such as \w or \d, but can be tested with \p if PCRE is built
with Unicode character property support. The use of locales with Uni-
code is discouraged. If you are handling characters with codes greater
than 128, you should either use UTF-8 and Unicode, or use locales, but
not try to mix the two.

PCRE contains an internal set of tables that are used when the final
argument of pcre_compile() is NULL. These are sufficient for many
applications. Normally, the internal tables recognize only ASCII char-
acters. However, when PCRE is built, it is possible to cause the inter-
nal tables to be rebuilt in the default "C" locale of the local system,
which may cause them to be different.

The internal tables can always be overridden by tables supplied by the
application that calls PCRE. These may be created in a different locale
from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away. [...]

From: http://pcre.org/pcre.txt
My guess is that my local PCRE version has been compiled with some extra locale, bleh. Any way I can see that anywhere?

I really am glad to read that locale support is being discouraged because it is a very vaguely documented and UTF-8 is the way to go. But for now the sad conclusion is... If you are using PCRE do not use \w if you want your regexes to be portable. Right?
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: What you need to know about \w

Post by prometheuzz »

GeertDD wrote:Thanks for your findings, prometheuzz. I'm on PHP 5.2.6 and PCRE 7.8. However, I don't think the different result is caused by the slightly different versions. I tried the same script on my external webhost (with PCRE 7.6) and the results are the same as yours.

The good news is that I found out in the PCRE manual that there is a locale setting available:
LOCALE SUPPORT

PCRE handles caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables, indexed
by character value. When running in UTF-8 mode, this applies only to
characters with codes less than 128. Higher-valued codes never match
escapes such as \w or \d, but can be tested with \p if PCRE is built
with Unicode character property support. The use of locales with Uni-
code is discouraged. If you are handling characters with codes greater
than 128, you should either use UTF-8 and Unicode, or use locales, but
not try to mix the two.

PCRE contains an internal set of tables that are used when the final
argument of pcre_compile() is NULL. These are sufficient for many
applications. Normally, the internal tables recognize only ASCII char-
acters. However, when PCRE is built, it is possible to cause the inter-
nal tables to be rebuilt in the default "C" locale of the local system,
which may cause them to be different.

The internal tables can always be overridden by tables supplied by the
application that calls PCRE. These may be created in a different locale
from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away. [...]

From: http://pcre.org/pcre.txt
My guess is that my local PCRE version has been compiled with some extra locale, bleh. I really am glad to read that locale support is being discouraged because it is a very vaguely documented and UTF-8 is the way to go.
And thank you too for the info!
GeertDD wrote:The sad conclusion is... If you are using PCRE do not use \w if you want your regexes to be portable. Right?
I'm with you on that!
Post Reply