UTF8 - Foreign characters not matching

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

rhecker
Forum Contributor
Posts: 178
Joined: Fri Jul 11, 2008 5:49 pm

UTF8 - Foreign characters not matching

Post by rhecker »

I am using preg_match to filter user input going to my database. Some users will enter Chinese, Vietnamese, or Japanese characters. Everything is UTF8 so I figured the u modifier would be the ticket, but using it does not result in a match with the fireign characters I have tried. Here is my regex:

Code: Select all

/^[a-zA-Z0-9\s\;\.\-\,\']*$/u
I'm guessing that this is a pretty common issue, but believe me, I've been hunting for a solution. so any thoughts appreciated.
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: UTF8 - Foreign characters not matching

Post by Apollo »

Exactly what are you trying to accomplish? Obviously a string with e.g. Japanese characters will not match this expression, because a Japanese char is not in the A-Z range.
rhecker
Forum Contributor
Posts: 178
Joined: Fri Jul 11, 2008 5:49 pm

Re: UTF8 - Foreign characters not matching

Post by rhecker »

...Japanese is not in the A-Z range.
Good point! I don't understand how to define a range that will encompass Japanese, Chinese, Latin and Vietnamese?
Exactly what are you trying to accomplish?
On my forms, which are course application forms, applicants may enter Chinese, Vietnamese or Latin characters in the name field, for instance. The forms themselves are in muliple languages.
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: UTF8 - Foreign characters not matching

Post by Apollo »

You are getting yourself in a wasp nest of ambiguities if you want to strictly allow "normal characters" only, in several (especially non-western) languages.

You consider a-z to be Latin. How about á and ã? Or ß (used in German), ij (one character, used in Dutch), or Ð, œ, æ, or the Icelandic þ?
In Spanish, questions and exclamations start with an upside-down question mark and exclamation mark: ¿ ¡ do you want to allow them?

Then there are Eastern-European chars like Ť, Ł, Ŗ, ġ, dž (one char!), ő (as opposed to ö), ṧ, ǯ, ą, ĕ, ā, ṉ, which are essentially normal Latin chars (A-Z) but with non-western diacritical marks.

Vietnamese is even more fun, consisting of Latin characters with bizarre combination of multiple diacritic marks: ằ ề ự ữ etc. These characters might also be constructed by using multiple unicode "combining diacritical marks", which are not characters on their own but rather control chars to add marks to the preceding character.

And this is just Latin. You don't even want to begin listing the pitfalls in Cyrillic, Hindi, Chinese, Japanese (with its three different writing systems), Right-to-Left alphabets like Hebrew, Farsi, or Arabic (including "half chars" for the beginning/end of words/sentences) etc.

There's a good reason you didn't find a regex or function like this out of the box :)
rhecker
Forum Contributor
Posts: 178
Joined: Fri Jul 11, 2008 5:49 pm

Re: UTF8 - Foreign characters not matching

Post by rhecker »

There's a good reason you didn't find a regex or function like this out of the box
Yes, but I'm sure this is a problem many other developers have faced.

Fact is, I have to build a form that allows for Chinese, Vietnamese and English input, and I have to filter it.

I appreciate the fact that you have pointed out the complexity, but one way or another I need a solution. I don't have any idea how to write a regex that allows for chinese and vietnamese.
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: UTF8 - Foreign characters not matching

Post by Apollo »

rhecker wrote:Yes, but I'm sure this is a problem many other developers have faced.
Sure, but due to the nature of all these ambiguous language differences and exceptions, by definition there simply won't be a satisfactory, one-size-fits-all solution for this.
Fact is, I have to build a form that allows for Chinese, Vietnamese and English input, and I have to filter it.
How would you like to distinguish English and Vietnamese from, for example, Italian? Or Welsh (Cymraeg), or Afrikaans?
I don't have any idea how to write a regex that allows for chinese and vietnamese.
You'll probably be able to find out what are the character (codepoint) ranges for characters from these languages on unicode.org. You can make a regexp allowing only characters from those ranges, plus some basic interpunction symbols, but you will:

- not be able to efficiently block other languages using Latin characters (such as French, Spanish, Finnish, Esperanto, Turkish, Ido, etc)

- for 100% sure confuse, frustrate, and/or piss off visitors who aren't able to enter normal text in your site (even in languages that you do intend to allow), for no obvious reason (the underlying reason being you forgot exception no.3174 for their particular language)

Similar issues for Asian languages. For example, Japanese uses 3 character sets, one of which is in fact Chinese. Considering this, you see how pointless it is to insist on allowing Chinese but blocking Japanese?
rhecker
Forum Contributor
Posts: 178
Joined: Fri Jul 11, 2008 5:49 pm

Re: UTF8 - Foreign characters not matching

Post by rhecker »

I'm not trying to block any particular language. I just want to make sure I allow the languages I am most likely to get input from while still having a viable filter.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: UTF8 - Foreign characters not matching

Post by Eran »

maybe it's better if you explained what you are trying to filter out using this method
rhecker
Forum Contributor
Posts: 178
Joined: Fri Jul 11, 2008 5:49 pm

Re: UTF8 - Foreign characters not matching

Post by rhecker »

I'm just trying to filter out the possibility of SQL injection and other threats to the website and database. Input also goes through mysql_real_escape_string.
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: UTF8 - Foreign characters not matching

Post by Apollo »

When input strings are used in SQL queries, then mysql_real_escape_string is all you need. When you're using them in HTML (i.e. if you print/echo them from your PHP, optionally after storing and retrieving them from a database), then htmlspecialchars is all you need.

Additionally, if your server uses magic quotes (which it shouldn't, magic quotes is really outdated), apply stripslashes on any input string first.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: UTF8 - Foreign characters not matching

Post by Eran »

As apollo said, you are using the wrong approach to prevent SQL injections. I'd suggest you read about it, and apply the proper solutions - using mysql_real_escape_string() and proper quoting
http://www.webappsec.org/projects/articles/091007.shtml
rhecker
Forum Contributor
Posts: 178
Joined: Fri Jul 11, 2008 5:49 pm

Re: UTF8 - Foreign characters not matching

Post by rhecker »

Apollo and Pytrin. Thanks, your comments are useful.

I have read several places that one should filter all user input, so that's why I trying to do this in addition to using mysql_real_escape_string. Also, I get a lot attempts to hack my database through the forms. I'd rather that those bad records never even got into the database.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: UTF8 - Foreign characters not matching

Post by Eran »

Escaping and quoting is used to protect against SQL injection. Filtering is used to protect against XSS attacks and other client-side vulnerabilities. Make sure what you are trying to achieve. If you are interested in protection against XSS attacks, you should look into HTML purifier, which does just that. Or, if you aren't at all interested in allowing HTML in user input, use strip_tags() with no arguments to remove all HTML from the input.

http://htmlpurifier.org/
http://php.net/manual/en/function.strip-tags.php
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: UTF8 - Foreign characters not matching

Post by Apollo »

rhecker wrote:I'd rather that those bad records never even got into the database.
Well perhaps you could use something very simplistic like

Code: Select all

preg_match('/((select|delete).+from|update.+set|(alter|truncate|drop).+table|<[a-z])/i',$input)
although it's not 100% safe (neither against false negatives nor false positives).

Of course this is just to keep the most obvious crap from entering your database at all, for security measures stick to our comments above.

Oh and Pytrin, thanks for the htmlpurifier link, never heard of it before and it seems very useful!
rhecker
Forum Contributor
Posts: 178
Joined: Fri Jul 11, 2008 5:49 pm

Re: UTF8 - Foreign characters not matching

Post by rhecker »

Thanks for your script, Apollo. My knowledge of regex is too simple to understand it. When I try to run it, it returns zeroes no matter what I feed it, as follows. Here's a little script I used to test it:

Code: Select all

<form action="<?php $_SERVER['PHP_SELF'] ?>" method="post">
input: <input type="text" name="input" size="50"><br/>
<input value="submit" name="submit" type="submit"/>
</form>
<?php if ($_POST[submit]) {
$input = $_POST[input];
$input2 = preg_match('/((select|delete).+from|update.+set|(alter|truncate|drop).+table|<[a-z])/i', $input);
$input3 = preg_match('/^[a-zA-Z0-9\s\;\.\-\,\']*$/u', $input);
echo "Unfiltered: $input  <br/>";
echo "Apollo's script: $input2 <br/>";
echo "Original script: $input3 <br/>";
}?>
Post Reply