Page 1 of 1
Unicode regex -- accented 'e' causing issues in splitting
Posted: Tue May 26, 2009 2:58 pm
by alex.barylski
This code works wonderfully:
Code: Select all
print_r(preg_split('/[\p{Z}\p{P}]+/u', 'Housewares, Home Decor and Accessories'));
Results:
Code: Select all
Array
(
[0] => Housewares
[1] => Home
[2] => Decor
[3] => and
[4] => Accessories
)
However when I try and use the same regex on a string with a accented character (or similar I imagine):
Code: Select all
print_r(preg_split('/[\p{Z}\p{P}]+/u', 'Housewares, Home Décor and Accessories'));
The string does not split into expected pairs...
Do I have to enable unicode support on the server maybe? Is PHP not actually handling Unicode? What am I missing?
Cheers,
Alex
Re: Unicode regex -- accented 'e' causing issues in splitting
Posted: Tue May 26, 2009 3:33 pm
by prometheuzz
I find it strange that you get no separate tokens on your second code-snippet. On my system, both snippets produce the same number of tokens:
Code: Select all
print_r(preg_split('/[\p{Z}\p{P}]+/u', 'Housewares, Home Decor and Accessories'));
print_r(preg_split('/[\p{Z}\p{P}]+/u', 'Housewares, Home Décor and Accessories'));
/* output:
Array
(
[0] => Housewares
[1] => Home
[2] => Decor
[3] => and
[4] => Accessories
)
Array
(
[0] => Housewares
[1] => Home
[2] => Décor
[3] => and
[4] => Accessories
)
*/
I have almost no knowledge of PHP, I only know a bit of regex-trickery, but if you want me to post some details of my PHP installation, you'll have to tell me what to post exactly.
Perhaps GeertDD has some insight into this?
(you there Geert?)
Re: Unicode regex -- accented 'e' causing issues in splitting
Posted: Tue May 26, 2009 3:46 pm
by Weirdan
However when I try and use the same regex on a string with a accented character
Make sure the string is in utf8 - for your example that would be making sure the file you put your test into is saved as utf.
Re: Unicode regex -- accented 'e' causing issues in splitting
Posted: Tue May 26, 2009 3:55 pm
by prometheuzz
Weirdan wrote:
However when I try and use the same regex on a string with a accented character
Make sure the string is in utf8 - for your example that would be making sure the file you put your test into is saved as utf.
Just curious: in what encoding would it cause the text NOT to split? And how come?
I mean, the 'é' is nothing special: it's just part of the ASCII set, so AFAIK, the encoding doesn't matter much, right?
Re: Unicode regex -- accented 'e' causing issues in splitting
Posted: Tue May 26, 2009 6:18 pm
by Weirdan
prometheuzz wrote:I mean, the 'é' is nothing special: it's just part of the ASCII set, so AFAIK, the encoding doesn't matter much, right?
Wrong. é belongs to 'extended ascii' (latin1 indeed). In both Unicode and latin1 it has codepoint value of 0xE9, but when encoded in utf-8 it becomes two-byte sequence 0xC3 0xA9, while in latin1 it's stored as literal 0xE9 byte. This is so if we're talking about fully composed unicode form (NFC). In fully decomposed form (NFD) é is represented as two characters (not to be confused with bytes), first being literal 'e' and second is a combining acute accent (U+0301), giving the overall utf-8 sequence of three bytes: 0x65 0xCC 0x81.
Re: Unicode regex -- accented 'e' causing issues in splitting
Posted: Wed May 27, 2009 2:30 am
by prometheuzz
Weirdan wrote:prometheuzz wrote:I mean, the 'é' is nothing special: it's just part of the ASCII set, so AFAIK, the encoding doesn't matter much, right?
Wrong. é belongs to 'extended ascii' (latin1 indeed). In both Unicode and latin1 it has codepoint value of 0xE9, but when encoded in utf-8 it becomes two-byte sequence 0xC3 0xA9, while in latin1 it's stored as literal 0xE9 byte. This is so if we're talking about fully composed unicode form (NFC). In fully decomposed form (NFD) é is represented as two characters (not to be confused with bytes), first being literal 'e' and second is a combining acute accent (U+0301), giving the overall utf-8 sequence of three bytes: 0x65 0xCC 0x81.
Ah, I see, thanks for the info Weirdan.