Page 1 of 1

check if string contains html entities

Posted: Thu Oct 09, 2014 6:23 pm
by cjkeane
Hi Everyone.
I am starting to have an issue which I'm not sure how to resolve.
I have a database which is all set using utf8.
all data up until now, I have run htmlspecialchars on content when it inserted into the db, so i always need to use html_entity_decode to decode properly.
i changed my script recently and i no longer convert any text to html entities when its submitted to the db. I leave it up to utf8 to display correctly which it does for all new records.
My issue is that all previously entered data may have some elements of html entities which i need to decode to display properly, but if i run html_entity_decode, it mangles the display of some newly entered data.

My thought is to do something like this: if html entities are detected in the string, then run html_entities_decode on it, otherwise do nothing.
I'm just not sure how to code for that. Any help would be appreciated. Thanks.

Re: check if string contains html entities

Posted: Thu Oct 09, 2014 6:44 pm
by requinix
How do you tell the difference between HTML entities that are supposed to be there and those that are not?

Re: check if string contains html entities

Posted: Thu Oct 09, 2014 8:15 pm
by cjkeane
i can only identify the entities by looking either at the html source code or in the actual db to see why they don't display correctly when viewed on the website.
for example:
1. a previous entry inserted into the db looks like this: <strong>testing</strong>
When viewed on the website, it displays as <strong>testing</strong>
2. a new entry is saved to the db like so: <strong>testing</strong> and displays on the website as tested actually bolded

Re: check if string contains html entities

Posted: Thu Oct 09, 2014 8:24 pm
by Celauran
I guess the obvious takeaways here are a.
cjkeane wrote:all data up until now, I have run htmlspecialchars on content when it inserted into the db
don't do that, and 2.
cjkeane wrote:i changed my script recently and i no longer convert any text to html entities when its submitted to the db.
don't do that. Worry about escaping your HTML when it comes out, not when it goes in but be consistent.

That said, how bad is the damage? Can you fix it to be consistent either one way or the other, or is there just too much?

Re: check if string contains html entities

Posted: Thu Oct 09, 2014 8:42 pm
by cjkeane
i just changed my script yesturday to just use mysql_real_escape string to get the data in.
up until then, i was using a function which applied htmlspecialchars. i then found i had to decode it when viewing it on the site which was working for the most part.
occassionally there would be an issue decoding it, but yesturday i had an inquiry if chinese characters could be saved into the db. when i tested it (and because i had enabled html_entities_decode) chinese characters were mangled.
which is why i asked if it was possible to detect if htmlspecialchars was within a string, if it was, then decode it, otherwise do nothing.
there are close to 300,000 records, but as i can tell only about 2500 records have htmlspecialchars applied.

whats the best way to accommodate both issues?

Re: check if string contains html entities

Posted: Thu Oct 09, 2014 8:45 pm
by Celauran
So 298,000 new records since yesterday? Wow. The good news is there are only 2,500 or so that need fixing. Are you using auto-incrementing primary keys? Can you easily identify the cutoff point and update only those records?

Re: check if string contains html entities

Posted: Thu Oct 09, 2014 9:02 pm
by cjkeane
no no. 300,000 records in the last two years. thats approx. the number of records in the entire db.
i just did a quick search and of the 2500 records which had some variant of htmlspecial chars, some were regarding accented characters, others were from html formatting. that being said, i did a quick search in the db, and i'm down to 14 records which i need to actually fix so its not as bad as i thought.