Page 1 of 1

Strange results when parsing accented chars

Posted: Fri Feb 17, 2006 6:32 pm
by stsr11
Can anyone tell my why the following matches 'fiancé' as 'fianc'(wrong) and 'idée' as 'idée'(correct) in PHP?

Bizarrely, it works perfectly in RegexBuddy!!!

Code: Select all

\b([a-zéä]+\-?[a-zéä]*){3,}\b
I have tried using both ascii and unicode alternatives for the accented chars - no difference.

I am pretty new to regex, but I think I am asking for any words(inc. French), min 3 chars in length, which may contain a hyphen...

Oh, I also use the /i (case insensitive) modifier...

Thanks

Seppo

Posted: Fri Feb 17, 2006 6:42 pm
by feyd
your pattern matches, in plain english: the beginning of a "word" which has letters, é, or ä one or more times, possibly followed by a minus/dash then any number of letters, é, or ä. All repeated three or more times until an end of word is found. The problem is, é is considering a word boundry

this may work better:

Code: Select all

$p = '#\b([a-zéä]{3,}(?:-[a-zéä]*)*)#i';

Posted: Fri Feb 17, 2006 7:06 pm
by stsr11
That worked great.

Thanks :)

Posted: Sat Feb 18, 2006 4:31 am
by raghavan20
but will it work for ..........

a-éä
az-éä
a-é

because you are posing an initial constraint of three characters before hyphen can occur but the author has asked the length of the whole string to a minimum of three characters.


untested...

Code: Select all

'#\b([a-zéä]+(?:-[a-zéä]*)*){3,}\b#i';