Working on a tutorial for UTF-8

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Working on a tutorial for UTF-8

Post by Ambush Commander »

I know, I know, there's already a profusion of UTF-8 related advice out there on the web. What I'm trying to do, however, is glue it all together into a monster tutorial that will talk about first what character encodings are, how to fix the easy mistakes, then why it's bad not to UTF-8, and then tips to keep in mind if you decide to migrate.

Here's my first draft: http://hp.jpsband.org/live/docs/enduser-utf8.html Soliciting comments!
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

I'm not sure I totally understand multiple language support in web applications - but from what i've read it was a good start.

Code: Select all

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The above is I assume, the catch all charset declaration? This would word with any language?

Code: Select all

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1256" />
The latter is setting the page for rendering using the Arabic alphabet - so what would happen if I used that charset but entered everything in English? Would everything get scrambled?

Just so I understand the process...

Your browser loads a page and checks for that META tag and based on that charset declaration it consumes the bytes of the document (minus the tags?) and renders them accordingly?

A = 64 (in ASCII from what I remember - assume it's correct for the sake of argument)
B = 65
C = 66

So when a browser encounters those three bytes consecutively by default it renders using ASCII so a web page would output:

ABC

What if your system locale is set to Thai? would the characters appear as gibberish little boxes?

When I browse the web and cross paths with a Japanese encoded document I'm prompoted to download language files. Those are the tables which tell the font rendering engine what "bytes" correspond to what "character"

I'm guessing fonts like "Tahoma" don't render in Japanese or Thai, etc? or does it depend on the font?

Sorry if my questions are trivial but I hate not fully understanding a subject - especially when it's an important one like this. :)

Cheers :)[/b]
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Post by matthijs »

It's not the META tag that's most important. Even if you have set the META tag to UTF-8 for example, if your server sends different headers you still end up with a different encoding
Note, however, that any real HTTP header will override a META element, so it's imperative that you set up the web server correctly. For Apache, you can do so by editing the configuration file (/etc/httpd.conf on most *nix systems). The directive should look something like this:

AddDefaultCharset UTF-8
http://www.sitepoint.com/article/guide- ... r-encoding
This article just got published, very long and good read.
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Post by Luke »

Nice tutorial! I'm definitely no expert on encoding, but good idea listing the "IE's Descriptions". If I were to read this tutorial as an IE user, without that list, I would be very frustrated. A few recommendations:
  • Explain html form accept-encoding attribute and maybe get into database and other storage encoding information
    Don't you also have to make sure that your editor is set to utf-8 (or whatever encoding you're using) as well?
    I'm not sure why it's entitled "UTF-8" because it is more of a general encoding overview as far as I can tell
    I would put a heading on the explanation of formatting... something along the lines of:
    article wrote:How to read this overview/tutorial
    Text in this formatting is an aside, interesting tidbits for the curious but not strictly necessary material to do the tutorial. If you read this text, you'll come out with a greater understanding of the underlying issues.
That's all I can think of for now. Nice work, I really think there needs to be more detailed and clear explanations of character encoding on the internet and you are definitely on the right track!
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Post by matthijs »

Yeah, I forgot to thank you for your efforts Ambush. It's a good tutorial.

Before I knew a bit about encoding I used a text editor on my windows machine which outputted ISO-8859-something or Windows-something, I can't remember, while I was assuming I used UTF-8 by setting the META tags to UTF-8. Was I wrong! No wonder I had some problems with strange characters in webpages. So The Ninja is right to say/ask:
Don't you also have to make sure that your editor is set to utf-8 (or whatever encoding you're using) as well?
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

I'm partially agreeing with matthijs. The meta tag is important, so is the Content-Type header and you should also specify the accept-charset attribute on the form tag.

AC: Nice job esp. with the IE -> real table but I have a few points. I can't help feeling by not discussing UTF-8's relationship with forms your tutorial is incomplete. Since HtmlPurfier is likely to be used on input from forms. And also:
AC's tutorial wrote:For now, take note if your META tag claims that either:

1. The character encoding is the same as the one reported by the browser,
2. The character encoding is different from the browser's, or
3. There is no META tag at all! (horror, horror!)
I'm not sure the preceding sentence to that list introduces it correctly. The sentence sounds more interesting than that list turns out to be.
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

Mismatch between actual encoding and the META/Content-Type is one of the most common problems - even for major websites. I'd cover the issue a little more. Another common problem is that editors open files which contain the common characters between ASCII and UTF-8, and call them ASCII (even if saved previously in UTF-8). Largely the problem is because the text contains no actual >255 characters, the editor assumes the encoding.

As an introductory tutorial, it needs a bit more body - but it's a good entry point for people unfamiliar or even unaware of the problem.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Ach, it's not done yet! :-) A lot of the things you guys mentioned I'm planning on adding, but you also mentioned some things that I wasn't thinking about and will toss in now.
The above is I assume, the catch all charset declaration? This would word with any language?
Technically speaking, yes. UTF-8, however, will not magically make other character encodings work.
Your browser loads a page and checks for that META tag and based on that charset declaration it consumes the bytes of the document (minus the tags?) and renders them accordingly?
The browser first checks the server content-type (thus the distinction between real and embedded encoding), then goes to the META tag if that was unsuccessful.
What if your system locale is set to Thai? would the characters appear as gibberish little boxes?
ASCII characters usually turn out okay because most character encodings preserve 1-128 as the regular characters. For the more funky ones, yes, the characters would become gibberish.
I'm guessing fonts like "Tahoma" don't render in Japanese or Thai, etc? or does it depend on the font?
Usually not. If a browser is smart, it'll automatically switch to another font that does support Japanese or Thai glyphs, but Internet Explorer doesn't do this, so you have to use Unicode-friendly fonts.
http://www.sitepoint.com/article/guide- ... r-encoding
This article just got published, very long and good read.
I'll look into it.

@The Ninja Space Goat: Your suggestions are good
I'm not sure why it's entitled "UTF-8" because it is more of a general encoding overview as far as I can tell
Well, the point is to get people to move to UTF-8. I can change it around if necessary.
I can't help feeling by not discussing UTF-8's relationship with forms your tutorial is incomplete.
Forms actually become quite simple as long as the containing page is in UTF-8. They get really weird when it's not. But yeah, I'll go over that.
I'm not sure the preceding sentence to that list introduces it correctly. The sentence sounds more interesting than that list turns out to be.
Hmm... maybe I can massage that sentence into something more appropriate.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Okay, I've finished the portion on figuring out the encodings of your pages and making sure they are consistent. This is a self-contained section in-and-of-itself, even without discussion of UTF-8. Will be continuing on later.

@matthijs: I've incorporated your suggestion, but I'd like to tell you that AddDefaultCharset is a very blunt weapon, and you may want to let META tags take precedence by setting it Off.

I'm wondering whether or not it's in the scope of this tutorial to also discuss editors and setting it to UTF-8. It's definitely a major pitfall, but I'm going off the heuristic: "Are users complaining about garbled text?" I am assuming that they've already got something working, and would like to fix it. Maybe when I talk about migrating to UTF-8. Same goes with BOM.
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

That sentence I spoke about is much better now :P
Also I'd like to commend you on the excellent use of layout and typography and superb use of semantic HTML, neither of which I noticed first time. Then again I'd expect no less from you AC :D. One thing which could improve findability is one of those Wikipedia-style article contents things.

Hint: You might want to also mention

Code: Select all

php_value default_charset "UTF-8"
Oh and thanks for the section on XML. It taught me lots I didn't know.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

Disscuss everything, from every angle...the more detail the better... :)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

One thing which could improve findability is one of those Wikipedia-style article contents things.
I'll add it when I finish. They're a pain to generate, a pain to style semantically correctly, and I haven't created a neat script that does it automatically.
php_value default_charset "UTF-8"
Hmm... I'll throw it in for the sake of comprehensiveness, but it requires PHP to be running as an Apache module (usually not the case). Plus, it's a dependency on two systems: Apache and PHP, that while common I'd like to shy away from. A technique can depend solely on PHP, or it can depend solely on Apache, but depending on both is asking for trouble. Plus, header() is usually a reasonable proposition if you're using a front controller architecture or an old-fashioned common include file.
Oh and thanks for the section on XML. It taught me lots I didn't know.
Good to hear.
the more detail the better...
Well, you can only put in so much info before peoples brains explode...
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Post by Luke »

Disscuss everything, from every angle...the more detail the better...
Well for certain audiences, this is a good idea, but I think Ambush Commander was going for a highly effective, yet quick-and-dirty explanation of the subject.

One of the largest problems with designing content for the internet is that there is already so much of it out there. It is so easy for your reader to push the little X on the top right and find another source of information. You want to be sure not to scare them off with a novel on the subject.

AC, I think you've done a nice job of finding that balance between giving a lot of detail, and not boring the reader. :wink:
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

Hmm... I'll throw it in for the sake of comprehensiveness, but it requires PHP to be running as an Apache module (usually not the case). Plus, it's a dependency on two systems: Apache and PHP, that while common I'd like to shy away from. A technique can depend solely on PHP, or it can depend solely on Apache, but depending on both is asking for trouble. Plus, header() is usually a reasonable proposition if you're using a front controller architecture or an old-fashioned common include file.
You can just talk about default_charset in php.ini instead then :)
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Post by matthijs »

Ambush Commander wrote:@matthijs: I've incorporated your suggestion, but I'd like to tell you that AddDefaultCharset is a very blunt weapon, and you may want to let META tags take precedence by setting it Off.
Well, I was just mentioning this because I found it a bit misleading in your (original) text. It came across as if just setting the META tag would be enough, while that obviously isn't the case. But as I read the current text that's totally clear, so either I didn't read the original text well or you changed it. Either way it's ok now :)
Ambush Commander wrote:I'm wondering whether or not it's in the scope of this tutorial to also discuss editors and setting it to UTF-8. It's definitely a major pitfall, but I'm going off the heuristic: "Are users complaining about garbled text?" I am assuming that they've already got something working, and would like to fix it. Maybe when I talk about migrating to UTF-8. Same goes with BOM.
I would definitely mention it. Because many people (?) will use a text editor to edit HTML, set their META tag to UTF-8 thinking they do the right thing, not knowing their HTML files are saved as some other encoding.. oops.
Post Reply