Strange results when parsing accented chars

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
stsr11
Forum Newbie
Posts: 17
Joined: Thu Jul 15, 2004 6:57 pm

Strange results when parsing accented chars

Post by stsr11 »

Can anyone tell my why the following matches 'fiancé' as 'fianc'(wrong) and 'idée' as 'idée'(correct) in PHP?

Bizarrely, it works perfectly in RegexBuddy!!!

Code: Select all

\b([a-zéä]+\-?[a-zéä]*){3,}\b
I have tried using both ascii and unicode alternatives for the accented chars - no difference.

I am pretty new to regex, but I think I am asking for any words(inc. French), min 3 chars in length, which may contain a hyphen...

Oh, I also use the /i (case insensitive) modifier...

Thanks

Seppo
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

your pattern matches, in plain english: the beginning of a "word" which has letters, é, or ä one or more times, possibly followed by a minus/dash then any number of letters, é, or ä. All repeated three or more times until an end of word is found. The problem is, é is considering a word boundry

this may work better:

Code: Select all

$p = '#\b([a-zéä]{3,}(?:-[a-zéä]*)*)#i';
stsr11
Forum Newbie
Posts: 17
Joined: Thu Jul 15, 2004 6:57 pm

Post by stsr11 »

That worked great.

Thanks :)
User avatar
raghavan20
DevNet Resident
Posts: 1451
Joined: Sat Jun 11, 2005 6:57 am
Location: London, UK
Contact:

Post by raghavan20 »

but will it work for ..........

a-éä
az-éä
a-é

because you are posing an initial constraint of three characters before hyphen can occur but the author has asked the length of the whole string to a minimum of three characters.


untested...

Code: Select all

'#\b([a-zéä]+(?:-[a-zéä]*)*){3,}\b#i';
Post Reply