Unicode Property and Character Class

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
veleshanas
Forum Newbie
Posts: 10
Joined: Sat May 24, 2008 10:58 pm

Unicode Property and Character Class

Post by veleshanas »

I am using the PHP preg_match function to validate Hebrew input to a Web form. Since the target match is a Hebrew word, the most simple regex would be:

preg_match("/^\p{Hebrew}+$/u",$var);

This is not always sufficient for Hebrew words can include two characters that have the punctuation Unicode property. For example, United States is ארה״ב (the doublequote-like character does not have \p{Hebrew} property).

I expected that I could make a kind of user-defined character class by combining a Unicode property with a character class.

preg_match("/^(\p{Hebrew}|[׳״])+$/u",$var);

The above construction, however, cannot match ארה״ב while ארהב as well as ״ are okay.

Could anyone help me understand why it does not work?
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Unicode Property and Character Class

Post by GeertDD »

You are telling the regex to either match one or more characters in the Hebrew Unicode property, or one or more punctuation marks.

Combine both properties in a character class and things should be fixed.

/^[\p{Hebrew}׳״]+$/u
veleshanas
Forum Newbie
Posts: 10
Joined: Sat May 24, 2008 10:58 pm

Re: Unicode Property and Character Class

Post by veleshanas »

GeertDD wrote:Combine both properties in a character class and things should be fixed.

/^[\p{Hebrew}׳״]+$/u
Hello GeertDD,
I didn't know that I can write a Unicode property within a character class. Good to know that not everything within a [ ] is literal. :banghead: <-- I was like this, weren't I?!
Post Reply