Page 1 of 1

Regexp for invalid characters, unicode

Posted: Fri Jan 18, 2008 3:50 am
by Hurreman
Hey everyone! Is it just me, or did the forums just get a facelift? Looking great :)

I'm having a bit of a headache ( :banghead: ) at the moment, due to trying to wrap my head around unicode (UTF-8) instead of using the old iso-8859-1, and using regular expression to support a wide array of languages.

The project I'm working on is a World of Warcraft "fansite" ( a new community site for guilds ), where users will be able to make lists of their characters, join and organize guilds, and more.

My head starts to hurt when I think about all the different character names that will pop up. As far as I can tell, everything that's prohibited for use in character names are special characters (!@'"\/& etc..), numbers and whitespace. After going through the crash course posted here, I thought that I had come up with the solution, by using:

Code: Select all

"\^[\W\d\s]$\"
, to match any non-alphanumberic characters, numbers and whitespace, but without success. Also, I'm not sure how "\W" will treat unicode characters like "é, ü, û", or names like "Meèn'ame" and so on.
Can anyone tell me what I'm doing wrong, and give me a few pointers to get me back on track?

Re: Regexp for invalid characters, unicode

Posted: Fri Jan 18, 2008 4:33 am
by VladSun
It's much easier to define the allowed char set:

/^[a-zA-Z_]+$/

or even better (min. 6 chars, max 20):

/^[a-zA-Z_]{6,20}$/

Re: Regexp for invalid characters, unicode

Posted: Fri Jan 18, 2008 4:47 am
by Hurreman
I need to match characters like "éåäöüø" and so on, which [a-ZA-Z} won't cover. I suppose I could add each allowed character, but that's a whole lot of characters. Which mean it'll probably be easier to match the non-allowed characters instead. I guess something like /["!#¤%&\/\\\-\+_~:;.,|<>*\d\s]/ could be a start, since I noticed that \W also matches åäö, etc..

Re: Regexp for invalid characters, unicode

Posted: Fri Jan 18, 2008 4:57 am
by VladSun
Try this:
/^\p{L}+$/

Re: Regexp for invalid characters, unicode

Posted: Fri Jan 18, 2008 6:23 am
by Hurreman
Ah yes... That looks like it could do the trick with some modification to allow the ' separator...

Re: Regexp for invalid characters, unicode

Posted: Fri Jan 18, 2008 6:29 am
by Hurreman
I think /^[\p{L}\p{Po}]+$/ may do it!

Thanks for the help! :)

Re: Regexp for invalid characters, unicode

Posted: Fri Jan 18, 2008 2:28 pm
by Hurreman
The previous regexp didn't do the trick, but now I've got this.. /^[\p{L}^\x{0027}^\x{0060}^\x{00B4}]+$/.

Allowing only letters and apostrophe/accents. I think I'll settle with this for now until I get some people to test out the application.