need help with decoding a html entity

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
siko
Forum Commoner
Posts: 37
Joined: Tue Feb 16, 2010 11:28 pm

need help with decoding a html entity

Post by siko »

I am trying to allow only english characters in my system, so I'm running something like this
(no need to inform user that they have invalid characters)

Code: Select all

 
function english_only ($value){
    return ereg_replace("[^a-zA-Z0-9\n\t\r\s!@#$%^&*()-_=+`~{}|\:;\"']", "", $value);
}
 
$name = english_only($_POST['name']);
 
//simple sql to store into DB
 
It is serving its purpose to a certain extent, only that some funny characters still get through. After some tests, I realised that some invalid characters in $_POST['name'] is coming through not in the form of a single character, but as html codes such as �

So the $name string might be something like 'John � Smith', and the regex replace will not work since it is reading in each character seperately. Hence I tried

Code: Select all

 
$name = html_entity_decode(english_only($_POST['name']));
 
However it seems that html_entity_decode is not converting the html code back to a character. I experimented with

Code: Select all

 
$string = "&";
echo strlen($string) // returns 5
$string = html_entity_decode($string);
echo strlen($string) // returns 1
 
That works, but..

Code: Select all

 
$string = "�";
echo strlen($string) // returns 8
$string = html_entity_decode($string);
echo strlen($string) // returns 8
 
is failing, but there is indeed such a character.

Anyone knows why is this happening?

Thanks!
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: need help with decoding a html entity

Post by josh »

Doesn't look like an html entity. Looks like a hexadecimal color prep-ended to an ampersand.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: need help with decoding a html entity

Post by requinix »

josh wrote:Doesn't look like an html entity. Looks like a hexadecimal color prep-ended to an ampersand.
Congratulations on not knowing HTML entities.


Lemme get this straight, siko: you're just outputting user input without any kind of escaping?
htmlentities. Use that and you won't have any entity problems.

...So what do you have against "foreign" languages?
siko
Forum Commoner
Posts: 37
Joined: Tue Feb 16, 2010 11:28 pm

Re: need help with decoding a html entity

Post by siko »

Tasairis,

To put simply,

1. I am trying to remove non-english characters from the user's input, before storing them into database. My site/database is running iso-8859-1/latin1, and correct me if I am wrong, I foresee funny characters appearing if user types in asian languages such as chinese/japanese/korean etc.

2. When user submits the form and the $_POST data comes to the php page, the $_POST variable has a character that is stored in html code form and not in a single character. For example, the $_POST['name'] does not store ?, but rather a string of �

I am using reg exp to remove these characters, therefore if the character is being expressed as a string of html code, the regex replace will fail because it reads each characters seperately, each individual characters in '�' is valid in my regex.

Following is my regex code:

Code: Select all

 
function english_only ($value){
     return ereg_replace("[^a-zA-Z0-9\n\t\r\s!@#$%^&*()-_=+`~{}|\:;\"']", "", $value);
}
 
$name = english_only($_POST['name']);
 
//simple sql to store into DB
 
3. My problem is, I cannot convert the html code back to single character form.

htmlentities actually takes the complex character and codes it into html code string, not the other way round. So if I do htmlentities('�'); I believe I will get something like '&#65533', further expanding the string.

Any thoughts?
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: need help with decoding a html entity

Post by requinix »

Okay. One at a time.
siko wrote:I am trying to remove non-english characters from the user's input, before storing them into database. My site/database is running iso-8859-1/latin1, and correct me if I am wrong, I foresee funny characters appearing if user types in asian languages such as chinese/japanese/korean etc.
So how about this crazy idea: instead of censoring input, why not do something so that you don't see "funny characters"? It's text encoding. Pick a good encoding - UTF-8 is awesome - and make sure everything uses that. If you do it right everybody's input will display properly.
siko wrote:When user submits the form and the $_POST data comes to the php page, the $_POST variable has a character that is stored in html code form and not in a single character. For example, the $_POST['name'] does not store �, but rather a string of �
Yeah... And the only way that'll be a problem is if you directly output the data. Which is a bad idea, regardless of pretty much everything. If you run the output through htmlentities first then you won't see the entities: you'll literally see "�".
siko wrote:My problem is, I cannot convert this code back to single character form.

htmlentities actually takes the complex character and codes it into html code string, not the other way round. So if I do htmlentities('�'); I believe I will get something like '&#65533', further expanding the string.
Yeah... So when you output that you'll get the entity as a string, not as the character it represents.
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: need help with decoding a html entity

Post by josh »

tasairis wrote:Lemme get this straight, siko: you're just outputting user input without any kind of escaping?
Congratulations on missing the point of the problem :crazy:

Here's the solution:
I wrote in a previous comment that html_entity_decode() only handled about 100 characters. That's not quite true; it only handles entities that exist in the output character set (the third argument). If you want to get ALL HTML entities, make sure you use ENT_QUOTES and set the third argument to 'UTF-8'.
- http://us2.php.net/html_entity_decode Matt Robinson
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: need help with decoding a html entity

Post by requinix »

josh wrote:
tasairis wrote:Lemme get this straight, siko: you're just outputting user input without any kind of escaping?
Congratulations on missing the point of the problem :crazy:
Oh no, I understood. But using html_entity_decode is not a solution: it's a workaround to a larger problem of text encoding and security hazards.
My apologies for caring.
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: need help with decoding a html entity

Post by josh »

tasairis wrote: But using html_entity_decode is not a solution.
Yes it is. He wants to decode the entity to it's UTF-8 symbol so his regex will filter the input.
siko
Forum Commoner
Posts: 37
Joined: Tue Feb 16, 2010 11:28 pm

Re: need help with decoding a html entity

Post by siko »

Ahh, were great help.

I have changed the charset to UTF-8 on my pages and utf8_general_ci for my database variables, and remove the input censoring. You are right, I think in the long run it will be better for the site.

And

Code: Select all

html_entity_decode($string, ENT_QUOTES, 'UTF-8');
is working like a charm too.

Thanks both of you for your replies! :D
Post Reply