Unicode Property and Character Class
Posted: Sat May 24, 2008 11:01 pm
I am using the PHP preg_match function to validate Hebrew input to a Web form. Since the target match is a Hebrew word, the most simple regex would be:
preg_match("/^\p{Hebrew}+$/u",$var);
This is not always sufficient for Hebrew words can include two characters that have the punctuation Unicode property. For example, United States is ארה״ב (the doublequote-like character does not have \p{Hebrew} property).
I expected that I could make a kind of user-defined character class by combining a Unicode property with a character class.
preg_match("/^(\p{Hebrew}|[׳״])+$/u",$var);
The above construction, however, cannot match ארה״ב while ארהב as well as ״ are okay.
Could anyone help me understand why it does not work?
preg_match("/^\p{Hebrew}+$/u",$var);
This is not always sufficient for Hebrew words can include two characters that have the punctuation Unicode property. For example, United States is ארה״ב (the doublequote-like character does not have \p{Hebrew} property).
I expected that I could make a kind of user-defined character class by combining a Unicode property with a character class.
preg_match("/^(\p{Hebrew}|[׳״])+$/u",$var);
The above construction, however, cannot match ארה״ב while ארהב as well as ״ are okay.
Could anyone help me understand why it does not work?