Page 1 of 1

Matching whole words in Unicode

Posted: Fri Jun 13, 2008 3:25 pm
by lwc
I want to find a whole word, but I don't manage to do it neither in ereg nor in preg when using Unicode.

Code: Select all

 
     $pattern = "pattern";
     $text = "a phrase that contains the word pattern as a whole word.";
 
     if ($pattern == utf8_encode($pattern)) {
         // The following patterns only work if $pattern is in pure Latin letters
        $ereg_pattern = "[[:<:]]{$pattern}[[:>:]]";
        $preg_pattern = "/\b$pattern\b/i";
     } else {
        $ereg_pattern = ??????????????????????
        $preg_pattern = ?????????????????????? //Note: "/\b$pattern\b/u" does NOT work - see below.
     }
 
// Now I can highlight the pattern
     $highlight_by_ereg = eregi_replace($ereg_pattern, '<font class="highlight">\\0</font>', $text);
// Or
     $highlight_by_preg = preg_replace($preg_pattern, '<font class="highlight">\\0</font>', $text);
 // Note: if $pattern is a word in Unicode and $preg_pattern was set to "/\b$pattern\b/u", than $highlight_by_preg just remains equal to the original $text (i.e. nothing is replaced).
 
 
What I need is something to replace either or both of those "?????????"... :D

Thanks!

Re: Matching whole words in Unicode

Posted: Fri Jun 13, 2008 3:27 pm
by Weirdan
you need /U switch in your preg regexp

Re: Matching whole words in Unicode

Posted: Fri Jun 13, 2008 3:29 pm
by lwc
I've just updated my post to show you it just doesn't help if I use "/u" (BTW, if I want both /i and /u, should I use "/iu"?).

Re: Matching whole words in Unicode

Posted: Sat Jun 14, 2008 12:53 am
by GeertDD
lwc wrote:BTW, if I want both /i and /u, should I use "/iu"?).
Yes. Note that the order of modifiers is not important, /ui is possible as well.

Use PCRE (the preg_functions). Make sure PCRE is compiled with UTF-8 support (--enable-utf8). Also support for Unicode properties is highly recommended (--enable-unicode-properties).

Here are two checks you can use for testing PCRE:

Code: Select all

if ( ! preg_match('/^.$/u', 'ñ'))   exit('No Unicode support at all');
if ( ! preg_match('/^\pL$/u', 'ñ')) exit('No support for Unicode properties');
This looks like a helpful page as well (though I did not read it yet): http://www.regular-expressions.info/unicode.html

Re: Matching whole words in Unicode

Posted: Sat Jun 14, 2008 12:09 pm
by lwc
I get "true" for both statements. I'm telling you, "\b" just doesn't catch Unicode.

I've managed to use "\X"

Code: Select all

"/\X{$pattern}\X/"
to catch Unicode, but then it demands spaces (i.e. it would catch " word " but not " word" [end of line] or "word " [start of line]).

Re: Matching whole words in Unicode

Posted: Sat Jun 14, 2008 2:35 pm
by GeertDD
\b does not "catch" text, it only matches a position.

\X does not match a position, but rather a Unicode character that could be made up out of multiple code points.

Re: Matching whole words in Unicode

Posted: Sun Jun 15, 2008 3:02 am
by lwc
With that said, what's the solution?

Re: Matching whole words in Unicode

Posted: Sun Jun 15, 2008 4:22 am
by GeertDD
I thought \b works with Unicode text as well. Could you post an example to see how it fails?

Re: Matching whole words in Unicode

Posted: Sun Jun 15, 2008 6:29 am
by lwc

Code: Select all

 
$pattern = "hallo";
echo preg_replace("/\b{$pattern}\b/", "hooray", "hallo halloah hallo hallo");
// Outputs:
// [b]hooray[/b] halloah [b]hooray[/b] [b]hooray[/b]
// 3 replacements.
 
$pattern = "hallo"; // Now let's ASSUME "hallo" is a word in Unicode
echo preg_replace("/\b{$pattern}\b/", "hooray", "hallo halloah hallo hallo");
// Outputs:
// hallo halloah hallo hallo
// 0 replacements...
 
$pattern = "hallo"; // Again let's ASSUME "hallo" is a word in Unicode
echo preg_replace("/\X{$pattern}\X/", "hooray", "hallo halloah hallo hallo"); // \X instead of \b
// Outputs:
// hallo halloah [b]hooray[/b] hallo
// This time we get 1 replacement (the one that had spaces before and after it).
// Although many things can go wrong because \X is simply not the same as \b
 

Re: Matching whole words in Unicode

Posted: Sun Jun 15, 2008 6:59 am
by GeertDD
I don't want to <span style='color:blue' title='I'm naughty, are you naughty?'>smurf</span> you off, but actually you only posted one example. Assumptions are not examples. Give me some real Unicode strings you are using.

One thing that is clear already now, is that \X should not be used at all for matching word boundaries. Forget about \X.

Re: Matching whole words in Unicode

Posted: Mon Jun 16, 2008 4:37 am
by lwc
$pattern = "hallo";
echo preg_replace("/\b{$pattern}\b/", "hooray", "hallo halloah hallo hallo");
// Outputs:
// hooray halloah hooray hooray
// 3 replacements.
 
$pattern = "בדיקה";
echo preg_replace("/\b{$pattern}\b/", "ארוחה", "בדיקה בבדיקה בדיקה בדיקה");
// Outputs:
// בדיקה בבדיקה בדיקה בדיקה
// 0 replacements...
 
$pattern = "בדיקה";
echo preg_replace("/\X{$pattern}\X/", "ארוחה", "בדיקה בבדיקה בדיקה בדיקה");
// Outputs:
// בדיקה בבדיקה ארוחה בדיקה
// 1 replacement (only the word that had spaces in it),,.

Re: Matching whole words in Unicode

Posted: Mon Jun 16, 2008 6:10 am
by GeertDD
Alright, that is the real stuff. I tried it on my localhost and get similar results. My text editor (TextMate) does weird things when trying to edit the Hebrew text, though. It is Hebrew, right?

I was thinking maybe we could recode the first \b into (?:^|\s) which matches either the beginning of the string or whitespace in front. The ending \b would then become (?:\s|$)

Re: Matching whole words in Unicode

Posted: Mon Jul 28, 2008 2:39 pm
by lwc
I was thinking maybe we could recode the first \b into (?:^|\s) which matches either the beginning of the string or whitespace in front. The ending \b would then become (?:\s|$)
I guess I should have mentioned that I only posted this after already trying this. See, space is not enough. What if there are things like a comma, a dot or a dash after the word? It's no good then. Actually, if it was then no one would have invented \b in the first place because there would be no need for it.

Update: the solution is:

Code: Select all

 
     $pattern = "pattern";
     $text = "a phrase that contains the word pattern as a whole word.";
 
     if ($pattern == utf8_encode($pattern))
         // The following pattern only works if $pattern is in pure Latin letters
        $preg_pattern = "/(?!(?:[^<]+>|[^>]+<\/a>))\b$pattern\b/i"; // The code before the first \b is needed to avoid replacing HTML tags
     else
        $preg_pattern = "/(?<!\p{L})$pattern(?!\p{L})/u"; //Because "/\b$pattern\b/u" does not work
     }
 
// Now I can highlight the pattern whether it's Unicode or not
     $highlight_by_preg = preg_replace($preg_pattern, '<font class="highlight">\\0</font>', $text);
 
Thanks to another forum for this solution.

Re: Matching whole words in Unicode

Posted: Sun Jan 04, 2009 12:13 pm
by lwc
GeertDD wrote:Make sure PCRE is compiled with UTF-8 support (--enable-utf8). Also support for Unicode properties is highly recommended (--enable-unicode-properties).

Here are two checks you can use for testing PCRE:
Thanks, but your tests require saving the file im UTF-8 mode (which is sometimes not possible because the file is part of a CMS or something that requires ANSI files). Also, they simply crash the system before even reaching the exit (if they come out as false).

Here are the improved tests:

Code: Select all

if ( ! @preg_match('/^.$/u', urldecode('%C3%B1'))))   exit('No Unicode support at all');
if ( ! @preg_match('/^\pL$/u', urldecode('%C3%B1')))) exit('No support for Unicode properties');