Regexp for invalid characters, unicode

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
Hurreman
Forum Commoner
Posts: 61
Joined: Sat Apr 29, 2006 8:42 am

Regexp for invalid characters, unicode

Post by Hurreman »

Hey everyone! Is it just me, or did the forums just get a facelift? Looking great :)

I'm having a bit of a headache ( :banghead: ) at the moment, due to trying to wrap my head around unicode (UTF-8) instead of using the old iso-8859-1, and using regular expression to support a wide array of languages.

The project I'm working on is a World of Warcraft "fansite" ( a new community site for guilds ), where users will be able to make lists of their characters, join and organize guilds, and more.

My head starts to hurt when I think about all the different character names that will pop up. As far as I can tell, everything that's prohibited for use in character names are special characters (!@'"\/& etc..), numbers and whitespace. After going through the crash course posted here, I thought that I had come up with the solution, by using:

Code: Select all

"\^[\W\d\s]$\"
, to match any non-alphanumberic characters, numbers and whitespace, but without success. Also, I'm not sure how "\W" will treat unicode characters like "é, ü, û", or names like "Meèn'ame" and so on.
Can anyone tell me what I'm doing wrong, and give me a few pointers to get me back on track?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Regexp for invalid characters, unicode

Post by VladSun »

It's much easier to define the allowed char set:

/^[a-zA-Z_]+$/

or even better (min. 6 chars, max 20):

/^[a-zA-Z_]{6,20}$/
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Hurreman
Forum Commoner
Posts: 61
Joined: Sat Apr 29, 2006 8:42 am

Re: Regexp for invalid characters, unicode

Post by Hurreman »

I need to match characters like "éåäöüø" and so on, which [a-ZA-Z} won't cover. I suppose I could add each allowed character, but that's a whole lot of characters. Which mean it'll probably be easier to match the non-allowed characters instead. I guess something like /["!#¤%&\/\\\-\+_~:;.,|<>*\d\s]/ could be a start, since I noticed that \W also matches åäö, etc..
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Regexp for invalid characters, unicode

Post by VladSun »

Try this:
/^\p{L}+$/
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Hurreman
Forum Commoner
Posts: 61
Joined: Sat Apr 29, 2006 8:42 am

Re: Regexp for invalid characters, unicode

Post by Hurreman »

Ah yes... That looks like it could do the trick with some modification to allow the ' separator...
User avatar
Hurreman
Forum Commoner
Posts: 61
Joined: Sat Apr 29, 2006 8:42 am

Re: Regexp for invalid characters, unicode

Post by Hurreman »

I think /^[\p{L}\p{Po}]+$/ may do it!

Thanks for the help! :)
User avatar
Hurreman
Forum Commoner
Posts: 61
Joined: Sat Apr 29, 2006 8:42 am

Re: Regexp for invalid characters, unicode

Post by Hurreman »

The previous regexp didn't do the trick, but now I've got this.. /^[\p{L}^\x{0027}^\x{0060}^\x{00B4}]+$/.

Allowing only letters and apostrophe/accents. I think I'll settle with this for now until I get some people to test out the application.
Post Reply