$pattern = "pattern";
$text = "a phrase that contains the word pattern as a whole word.";
if ($pattern == utf8_encode($pattern)) {
// The following patterns only work if $pattern is in pure Latin letters
$ereg_pattern = "[[:<:]]{$pattern}[[:>:]]";
$preg_pattern = "/\b$pattern\b/i";
} else {
$ereg_pattern = ??????????????????????
$preg_pattern = ?????????????????????? //Note: "/\b$pattern\b/u" does NOT work - see below.
}
// Now I can highlight the pattern
$highlight_by_ereg = eregi_replace($ereg_pattern, '<font class="highlight">\\0</font>', $text);
// Or
$highlight_by_preg = preg_replace($preg_pattern, '<font class="highlight">\\0</font>', $text);
// Note: if $pattern is a word in Unicode and $preg_pattern was set to "/\b$pattern\b/u", than $highlight_by_preg just remains equal to the original $text (i.e. nothing is replaced).
What I need is something to replace either or both of those "?????????"...
Thanks!
Last edited by lwc on Fri Jun 13, 2008 6:19 pm, edited 8 times in total.
lwc wrote:BTW, if I want both /i and /u, should I use "/iu"?).
Yes. Note that the order of modifiers is not important, /ui is possible as well.
Use PCRE (the preg_functions). Make sure PCRE is compiled with UTF-8 support (--enable-utf8). Also support for Unicode properties is highly recommended (--enable-unicode-properties).
$pattern = "hallo";
echo preg_replace("/\b{$pattern}\b/", "hooray", "hallo halloah hallo hallo");
// Outputs:
// [b]hooray[/b] halloah [b]hooray[/b] [b]hooray[/b]
// 3 replacements.
$pattern = "hallo"; // Now let's ASSUME "hallo" is a word in Unicode
echo preg_replace("/\b{$pattern}\b/", "hooray", "hallo halloah hallo hallo");
// Outputs:
// hallo halloah hallo hallo
// 0 replacements...
$pattern = "hallo"; // Again let's ASSUME "hallo" is a word in Unicode
echo preg_replace("/\X{$pattern}\X/", "hooray", "hallo halloah hallo hallo"); // \X instead of \b
// Outputs:
// hallo halloah [b]hooray[/b] hallo
// This time we get 1 replacement (the one that had spaces before and after it).
// Although many things can go wrong because \X is simply not the same as \b
I don't want to <span style='color:blue' title='I'm naughty, are you naughty?'>smurf</span> you off, but actually you only posted one example. Assumptions are not examples. Give me some real Unicode strings you are using.
One thing that is clear already now, is that \X should not be used at all for matching word boundaries. Forget about \X.
Alright, that is the real stuff. I tried it on my localhost and get similar results. My text editor (TextMate) does weird things when trying to edit the Hebrew text, though. It is Hebrew, right?
I was thinking maybe we could recode the first \b into (?:^|\s) which matches either the beginning of the string or whitespace in front. The ending \b would then become (?:\s|$)
I was thinking maybe we could recode the first \b into (?:^|\s) which matches either the beginning of the string or whitespace in front. The ending \b would then become (?:\s|$)
I guess I should have mentioned that I only posted this after already trying this. See, space is not enough. What if there are things like a comma, a dot or a dash after the word? It's no good then. Actually, if it was then no one would have invented \b in the first place because there would be no need for it.
$pattern = "pattern";
$text = "a phrase that contains the word pattern as a whole word.";
if ($pattern == utf8_encode($pattern))
// The following pattern only works if $pattern is in pure Latin letters
$preg_pattern = "/(?!(?:[^<]+>|[^>]+<\/a>))\b$pattern\b/i"; // The code before the first \b is needed to avoid replacing HTML tags
else
$preg_pattern = "/(?<!\p{L})$pattern(?!\p{L})/u"; //Because "/\b$pattern\b/u" does not work
}
// Now I can highlight the pattern whether it's Unicode or not
$highlight_by_preg = preg_replace($preg_pattern, '<font class="highlight">\\0</font>', $text);
GeertDD wrote:Make sure PCRE is compiled with UTF-8 support (--enable-utf8). Also support for Unicode properties is highly recommended (--enable-unicode-properties).
Here are two checks you can use for testing PCRE:
Thanks, but your tests require saving the file im UTF-8 mode (which is sometimes not possible because the file is part of a CMS or something that requires ANSI files). Also, they simply crash the system before even reaching the exit (if they come out as false).
if ( ! @preg_match('/^.$/u', urldecode('%C3%B1')))) exit('No Unicode support at all');
if ( ! @preg_match('/^\pL$/u', urldecode('%C3%B1')))) exit('No support for Unicode properties');