Unicode regex -- accented 'e' causing issues in splitting

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Unicode regex -- accented 'e' causing issues in splitting

Post by alex.barylski »

This code works wonderfully:

Code: Select all

print_r(preg_split('/[\p{Z}\p{P}]+/u', 'Housewares, Home Decor and Accessories'));
Results:

Code: Select all

Array
(
    [0] => Housewares
    [1] => Home
    [2] => Decor
    [3] => and
    [4] => Accessories
)
However when I try and use the same regex on a string with a accented character (or similar I imagine):

Code: Select all

print_r(preg_split('/[\p{Z}\p{P}]+/u', 'Housewares, Home Décor and Accessories'));
The string does not split into expected pairs...

Do I have to enable unicode support on the server maybe? Is PHP not actually handling Unicode? What am I missing?

Cheers,
Alex
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Unicode regex -- accented 'e' causing issues in splitting

Post by prometheuzz »

I find it strange that you get no separate tokens on your second code-snippet. On my system, both snippets produce the same number of tokens:

Code: Select all

print_r(preg_split('/[\p{Z}\p{P}]+/u', 'Housewares, Home Decor and Accessories'));
print_r(preg_split('/[\p{Z}\p{P}]+/u', 'Housewares, Home Décor and Accessories'));
 
/* output:
 
Array
(
    [0] => Housewares
    [1] => Home
    [2] => Decor
    [3] => and
    [4] => Accessories
)
Array
(
    [0] => Housewares
    [1] => Home
    [2] => Décor
    [3] => and
    [4] => Accessories
)
*/
I have almost no knowledge of PHP, I only know a bit of regex-trickery, but if you want me to post some details of my PHP installation, you'll have to tell me what to post exactly.

Perhaps GeertDD has some insight into this? (you there Geert?)
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Unicode regex -- accented 'e' causing issues in splitting

Post by Weirdan »

However when I try and use the same regex on a string with a accented character
Make sure the string is in utf8 - for your example that would be making sure the file you put your test into is saved as utf.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Unicode regex -- accented 'e' causing issues in splitting

Post by prometheuzz »

Weirdan wrote:
However when I try and use the same regex on a string with a accented character
Make sure the string is in utf8 - for your example that would be making sure the file you put your test into is saved as utf.
Just curious: in what encoding would it cause the text NOT to split? And how come?
I mean, the 'é' is nothing special: it's just part of the ASCII set, so AFAIK, the encoding doesn't matter much, right?
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Unicode regex -- accented 'e' causing issues in splitting

Post by Weirdan »

prometheuzz wrote:I mean, the 'é' is nothing special: it's just part of the ASCII set, so AFAIK, the encoding doesn't matter much, right?
Wrong. é belongs to 'extended ascii' (latin1 indeed). In both Unicode and latin1 it has codepoint value of 0xE9, but when encoded in utf-8 it becomes two-byte sequence 0xC3 0xA9, while in latin1 it's stored as literal 0xE9 byte. This is so if we're talking about fully composed unicode form (NFC). In fully decomposed form (NFD) é is represented as two characters (not to be confused with bytes), first being literal 'e' and second is a combining acute accent (U+0301), giving the overall utf-8 sequence of three bytes: 0x65 0xCC 0x81.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Unicode regex -- accented 'e' causing issues in splitting

Post by prometheuzz »

Weirdan wrote:
prometheuzz wrote:I mean, the 'é' is nothing special: it's just part of the ASCII set, so AFAIK, the encoding doesn't matter much, right?
Wrong. é belongs to 'extended ascii' (latin1 indeed). In both Unicode and latin1 it has codepoint value of 0xE9, but when encoded in utf-8 it becomes two-byte sequence 0xC3 0xA9, while in latin1 it's stored as literal 0xE9 byte. This is so if we're talking about fully composed unicode form (NFC). In fully decomposed form (NFD) é is represented as two characters (not to be confused with bytes), first being literal 'e' and second is a combining acute accent (U+0301), giving the overall utf-8 sequence of three bytes: 0x65 0xCC 0x81.
Ah, I see, thanks for the info Weirdan.
Post Reply