Page 1 of 1

need help with decoding a html entity

Posted: Thu Feb 18, 2010 11:28 pm
by siko
I am trying to allow only english characters in my system, so I'm running something like this
(no need to inform user that they have invalid characters)

Code: Select all

 
function english_only ($value){
    return ereg_replace("[^a-zA-Z0-9\n\t\r\s!@#$%^&*()-_=+`~{}|\:;\"']", "", $value);
}
 
$name = english_only($_POST['name']);
 
//simple sql to store into DB
 
It is serving its purpose to a certain extent, only that some funny characters still get through. After some tests, I realised that some invalid characters in $_POST['name'] is coming through not in the form of a single character, but as html codes such as �

So the $name string might be something like 'John � Smith', and the regex replace will not work since it is reading in each character seperately. Hence I tried

Code: Select all

 
$name = html_entity_decode(english_only($_POST['name']));
 
However it seems that html_entity_decode is not converting the html code back to a character. I experimented with

Code: Select all

 
$string = "&";
echo strlen($string) // returns 5
$string = html_entity_decode($string);
echo strlen($string) // returns 1
 
That works, but..

Code: Select all

 
$string = "�";
echo strlen($string) // returns 8
$string = html_entity_decode($string);
echo strlen($string) // returns 8
 
is failing, but there is indeed such a character.

Anyone knows why is this happening?

Thanks!

Re: need help with decoding a html entity

Posted: Thu Feb 18, 2010 11:46 pm
by josh
Doesn't look like an html entity. Looks like a hexadecimal color prep-ended to an ampersand.

Re: need help with decoding a html entity

Posted: Thu Feb 18, 2010 11:47 pm
by requinix
josh wrote:Doesn't look like an html entity. Looks like a hexadecimal color prep-ended to an ampersand.
Congratulations on not knowing HTML entities.


Lemme get this straight, siko: you're just outputting user input without any kind of escaping?
htmlentities. Use that and you won't have any entity problems.

...So what do you have against "foreign" languages?

Re: need help with decoding a html entity

Posted: Fri Feb 19, 2010 1:10 am
by siko
Tasairis,

To put simply,

1. I am trying to remove non-english characters from the user's input, before storing them into database. My site/database is running iso-8859-1/latin1, and correct me if I am wrong, I foresee funny characters appearing if user types in asian languages such as chinese/japanese/korean etc.

2. When user submits the form and the $_POST data comes to the php page, the $_POST variable has a character that is stored in html code form and not in a single character. For example, the $_POST['name'] does not store ?, but rather a string of �

I am using reg exp to remove these characters, therefore if the character is being expressed as a string of html code, the regex replace will fail because it reads each characters seperately, each individual characters in '�' is valid in my regex.

Following is my regex code:

Code: Select all

 
function english_only ($value){
     return ereg_replace("[^a-zA-Z0-9\n\t\r\s!@#$%^&*()-_=+`~{}|\:;\"']", "", $value);
}
 
$name = english_only($_POST['name']);
 
//simple sql to store into DB
 
3. My problem is, I cannot convert the html code back to single character form.

htmlentities actually takes the complex character and codes it into html code string, not the other way round. So if I do htmlentities('�'); I believe I will get something like '&#65533', further expanding the string.

Any thoughts?

Re: need help with decoding a html entity

Posted: Fri Feb 19, 2010 1:17 am
by requinix
Okay. One at a time.
siko wrote:I am trying to remove non-english characters from the user's input, before storing them into database. My site/database is running iso-8859-1/latin1, and correct me if I am wrong, I foresee funny characters appearing if user types in asian languages such as chinese/japanese/korean etc.
So how about this crazy idea: instead of censoring input, why not do something so that you don't see "funny characters"? It's text encoding. Pick a good encoding - UTF-8 is awesome - and make sure everything uses that. If you do it right everybody's input will display properly.
siko wrote:When user submits the form and the $_POST data comes to the php page, the $_POST variable has a character that is stored in html code form and not in a single character. For example, the $_POST['name'] does not store �, but rather a string of �
Yeah... And the only way that'll be a problem is if you directly output the data. Which is a bad idea, regardless of pretty much everything. If you run the output through htmlentities first then you won't see the entities: you'll literally see "�".
siko wrote:My problem is, I cannot convert this code back to single character form.

htmlentities actually takes the complex character and codes it into html code string, not the other way round. So if I do htmlentities('�'); I believe I will get something like '&#65533', further expanding the string.
Yeah... So when you output that you'll get the entity as a string, not as the character it represents.

Re: need help with decoding a html entity

Posted: Fri Feb 19, 2010 3:53 am
by josh
tasairis wrote:Lemme get this straight, siko: you're just outputting user input without any kind of escaping?
Congratulations on missing the point of the problem :crazy:

Here's the solution:
I wrote in a previous comment that html_entity_decode() only handled about 100 characters. That's not quite true; it only handles entities that exist in the output character set (the third argument). If you want to get ALL HTML entities, make sure you use ENT_QUOTES and set the third argument to 'UTF-8'.
- http://us2.php.net/html_entity_decode Matt Robinson

Re: need help with decoding a html entity

Posted: Fri Feb 19, 2010 4:29 am
by requinix
josh wrote:
tasairis wrote:Lemme get this straight, siko: you're just outputting user input without any kind of escaping?
Congratulations on missing the point of the problem :crazy:
Oh no, I understood. But using html_entity_decode is not a solution: it's a workaround to a larger problem of text encoding and security hazards.
My apologies for caring.

Re: need help with decoding a html entity

Posted: Fri Feb 19, 2010 7:15 am
by josh
tasairis wrote: But using html_entity_decode is not a solution.
Yes it is. He wants to decode the entity to it's UTF-8 symbol so his regex will filter the input.

Re: need help with decoding a html entity

Posted: Sat Feb 20, 2010 2:15 am
by siko
Ahh, were great help.

I have changed the charset to UTF-8 on my pages and utf8_general_ci for my database variables, and remove the input censoring. You are right, I think in the long run it will be better for the site.

And

Code: Select all

html_entity_decode($string, ENT_QUOTES, 'UTF-8');
is working like a charm too.

Thanks both of you for your replies! :D