Page 1 of 1

Unicode Property and Character Class

Posted: Sat May 24, 2008 11:01 pm
by veleshanas
I am using the PHP preg_match function to validate Hebrew input to a Web form. Since the target match is a Hebrew word, the most simple regex would be:

preg_match("/^\p{Hebrew}+$/u",$var);

This is not always sufficient for Hebrew words can include two characters that have the punctuation Unicode property. For example, United States is ארה״ב (the doublequote-like character does not have \p{Hebrew} property).

I expected that I could make a kind of user-defined character class by combining a Unicode property with a character class.

preg_match("/^(\p{Hebrew}|[׳״])+$/u",$var);

The above construction, however, cannot match ארה״ב while ארהב as well as ״ are okay.

Could anyone help me understand why it does not work?

Re: Unicode Property and Character Class

Posted: Sun May 25, 2008 12:21 am
by GeertDD
You are telling the regex to either match one or more characters in the Hebrew Unicode property, or one or more punctuation marks.

Combine both properties in a character class and things should be fixed.

/^[\p{Hebrew}׳״]+$/u

Re: Unicode Property and Character Class

Posted: Sun May 25, 2008 6:42 am
by veleshanas
GeertDD wrote:Combine both properties in a character class and things should be fixed.

/^[\p{Hebrew}׳״]+$/u
Hello GeertDD,
I didn't know that I can write a Unicode property within a character class. Good to know that not everything within a [ ] is literal. :banghead: <-- I was like this, weren't I?!