Page 1 of 1

HTML entities on UTF-8 site, best practice

Posted: Fri Jun 26, 2009 7:12 am
by batfastad
Hi guys

This is something I've always wondered what the best practice was.
Back in my HTML learning days (a frightening 12 years ago) I always thought you should encode accented chars and special chars into their entities... either the entity or numeric code with numeric code preferred. IIRC the W3C validator checked that characters were properly entity-ised back in those days.

Recently I ran the validator over a UTF-8 site which had many accented chars non-entityised (just pasted into the HTML as a raw text character) and the validator didn't flag those up. They all displayed correctly, to me anyway. I was under the impression that they shd be converted to entities.

Obviously you still need to entityise HTML special chars (" > <), but should you still entity-ise other characters? Accents, symbols etc?

The reason I ask is I'm building a CMS for our website on our intranet where select users will be able to type HTML code directly into our websites.
I want to know whether to advise them to always entity-encode accents/symbols/anything... or just use entities for HTML special chars?

Cheers, B

Re: HTML entities on UTF-8 site, best practice

Posted: Sat Jun 27, 2009 9:21 am
by kaszu
Characters like 'āēūīķļņšž' (and other for other languages) doesn't need to be converted to entities.
From http://www.w3.org/TR/xhtml1/#a_dtd_Special_characters "Entity Sets":
http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent

Re: HTML entities on UTF-8 site, best practice

Posted: Sat Jun 27, 2009 11:15 am
by batfastad
Ah ok
But anything with an official entity equivalent... euro, copyright, accented western euro chars etc... should all be done using entities?

Cheers, B

Re: HTML entities on UTF-8 site, best practice

Posted: Wed Jul 01, 2009 11:35 am
by batfastad
Right after reading plenty of articles and being in IRC all day, I've got my plan of attack.

Only HTML special chars... < > & " should be represented as entities
Apostrophes I only absolutely need to do when outputting XML (eg: RSS) or when using apostrophes to enclose attribute values eg:

Code: Select all

<a title='dave's computer'> //ERROR
<a title='dave&apos;s computer'> //CORRECT
<a title="dave's computer"> //CORRECT
Everything else should be stored as the plain text UTF-8 character in the DB

At least doing it this way I can make it consistent, so it's easy to convert at a later stage

Hope this helps someone out :)

Re: HTML entities on UTF-8 site, best practice

Posted: Thu Jul 16, 2009 1:20 pm
by DaiLaughing
As long as your server is set up properly (Ubuntu isn't for one as the php generated content loses utf-8 encoding unless you manually change php.ini).