Page 1 of 2
Working on a tutorial for UTF-8
Posted: Fri Jan 12, 2007 10:00 pm
by Ambush Commander
I know, I know, there's already a profusion of UTF-8 related advice out there on the web. What I'm trying to do, however, is glue it all together into a monster tutorial that will talk about first what character encodings are, how to fix the easy mistakes, then why it's bad not to UTF-8, and then tips to keep in mind if you decide to migrate.
Here's my first draft:
http://hp.jpsband.org/live/docs/enduser-utf8.html Soliciting comments!
Posted: Fri Jan 12, 2007 10:26 pm
by alex.barylski
I'm not sure I totally understand multiple language support in web applications - but from what i've read it was a good start.
Code: Select all
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The above is I assume, the catch all charset declaration? This would word with any language?
Code: Select all
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1256" />
The latter is setting the page for rendering using the Arabic alphabet - so what would happen if I used that charset but entered everything in English? Would everything get scrambled?
Just so I understand the process...
Your browser loads a page and checks for that META tag and based on that charset declaration it consumes the bytes of the document (minus the tags?) and renders them accordingly?
A = 64 (in ASCII from what I remember - assume it's correct for the sake of argument)
B = 65
C = 66
So when a browser encounters those three bytes consecutively by default it renders using ASCII so a web page would output:
ABC
What if your system locale is set to Thai? would the characters appear as gibberish little boxes?
When I browse the web and cross paths with a Japanese encoded document I'm prompoted to download language files. Those are the tables which tell the font rendering engine what "bytes" correspond to what "character"
I'm guessing fonts like "Tahoma" don't render in Japanese or Thai, etc? or does it depend on the font?
Sorry if my questions are trivial but I hate not fully understanding a subject - especially when it's an important one like this.
Cheers

[/b]
Posted: Sat Jan 13, 2007 1:24 am
by matthijs
It's not the META tag that's most important. Even if you have set the META tag to UTF-8 for example, if your server sends different headers you still end up with a different encoding
Note, however, that any real HTTP header will override a META element, so it's imperative that you set up the web server correctly. For Apache, you can do so by editing the configuration file (/etc/httpd.conf on most *nix systems). The directive should look something like this:
AddDefaultCharset UTF-8
http://www.sitepoint.com/article/guide- ... r-encoding
This article just got published, very long and good read.
Posted: Sat Jan 13, 2007 3:57 am
by Luke
Nice tutorial! I'm definitely no expert on encoding, but good idea listing the "IE's Descriptions". If I were to read this tutorial as an IE user, without that list, I would be very frustrated. A few recommendations:
- Explain html form accept-encoding attribute and maybe get into database and other storage encoding information
Don't you also have to make sure that your editor is set to utf-8 (or whatever encoding you're using) as well?
I'm not sure why it's entitled "UTF-8" because it is more of a general encoding overview as far as I can tell
I would put a heading on the explanation of formatting... something along the lines of:
article wrote:How to read this overview/tutorial
Text in this formatting is an aside, interesting tidbits for the curious but not strictly necessary material to do the tutorial. If you read this text, you'll come out with a greater understanding of the underlying issues.
That's all I can think of for now. Nice work, I really think there needs to be more detailed and clear explanations of character encoding on the internet and you are definitely on the right track!
Posted: Sat Jan 13, 2007 4:05 am
by matthijs
Yeah, I forgot to thank you for your efforts Ambush. It's a good tutorial.
Before I knew a bit about encoding I used a text editor on my windows machine which outputted ISO-8859-something or Windows-something, I can't remember, while I was assuming I used UTF-8 by setting the META tags to UTF-8. Was I wrong! No wonder I had some problems with strange characters in webpages. So The Ninja is right to say/ask:
Don't you also have to make sure that your editor is set to utf-8 (or whatever encoding you're using) as well?
Posted: Sat Jan 13, 2007 4:27 am
by Ollie Saunders
I'm partially agreeing with matthijs. The meta tag is important, so is the Content-Type header and you should also specify the accept-charset attribute on the form tag.
AC: Nice job esp. with the IE -> real table but I have a few points. I can't help feeling by not discussing UTF-8's relationship with forms your tutorial is incomplete. Since HtmlPurfier is likely to be used on input from forms. And also:
AC's tutorial wrote:For now, take note if your META tag claims that either:
1. The character encoding is the same as the one reported by the browser,
2. The character encoding is different from the browser's, or
3. There is no META tag at all! (horror, horror!)
I'm not sure the preceding sentence to that list introduces it correctly. The sentence sounds more interesting than that list turns out to be.
Posted: Sat Jan 13, 2007 4:44 am
by Maugrim_The_Reaper
Mismatch between actual encoding and the META/Content-Type is one of the most common problems - even for major websites. I'd cover the issue a little more. Another common problem is that editors open files which contain the common characters between ASCII and UTF-8, and call them ASCII (even if saved previously in UTF-8). Largely the problem is because the text contains no actual >255 characters, the editor assumes the encoding.
As an introductory tutorial, it needs a bit more body - but it's a good entry point for people unfamiliar or even unaware of the problem.
Posted: Sat Jan 13, 2007 6:26 am
by Ambush Commander
Ach, it's not done yet!

A lot of the things you guys mentioned I'm planning on adding, but you also mentioned some things that I wasn't thinking about and will toss in now.
The above is I assume, the catch all charset declaration? This would word with any language?
Technically speaking, yes. UTF-8, however, will not magically make other character encodings work.
Your browser loads a page and checks for that META tag and based on that charset declaration it consumes the bytes of the document (minus the tags?) and renders them accordingly?
The browser first checks the server content-type (thus the distinction between real and embedded encoding), then goes to the META tag if that was unsuccessful.
What if your system locale is set to Thai? would the characters appear as gibberish little boxes?
ASCII characters usually turn out okay because most character encodings preserve 1-128 as the regular characters. For the more funky ones, yes, the characters would become gibberish.
I'm guessing fonts like "Tahoma" don't render in Japanese or Thai, etc? or does it depend on the font?
Usually not. If a browser is smart, it'll automatically switch to another font that does support Japanese or Thai glyphs, but Internet Explorer doesn't do this, so you have to use Unicode-friendly fonts.
I'll look into it.
@The Ninja Space Goat: Your suggestions are good
I'm not sure why it's entitled "UTF-8" because it is more of a general encoding overview as far as I can tell
Well, the point is to get people to move to UTF-8. I can change it around if necessary.
I can't help feeling by not discussing UTF-8's relationship with forms your tutorial is incomplete.
Forms actually become quite simple as long as the containing page is in UTF-8. They get really weird when it's not. But yeah, I'll go over that.
I'm not sure the preceding sentence to that list introduces it correctly. The sentence sounds more interesting than that list turns out to be.
Hmm... maybe I can massage that sentence into something more appropriate.
Posted: Sat Jan 13, 2007 8:36 pm
by Ambush Commander
Okay, I've finished the portion on figuring out the encodings of your pages and making sure they are consistent. This is a self-contained section in-and-of-itself, even without discussion of UTF-8. Will be continuing on later.
@matthijs: I've incorporated your suggestion, but I'd like to tell you that AddDefaultCharset is a very blunt weapon, and you may want to let META tags take precedence by setting it Off.
I'm wondering whether or not it's in the scope of this tutorial to also discuss editors and setting it to UTF-8. It's definitely a major pitfall, but I'm going off the heuristic: "Are users complaining about garbled text?" I am assuming that they've already got something working, and would like to fix it. Maybe when I talk about migrating to UTF-8. Same goes with BOM.
Posted: Sat Jan 13, 2007 8:59 pm
by Ollie Saunders
That sentence I spoke about is much better now

Also I'd like to commend you on the excellent use of layout and typography and superb use of semantic HTML, neither of which I noticed first time. Then again I'd expect no less from you AC

. One thing which could improve findability is one of those Wikipedia-style article contents things.
Hint: You might want to also mention
Oh and thanks for the section on XML. It taught me lots I didn't know.
Posted: Sat Jan 13, 2007 9:01 pm
by alex.barylski
Disscuss everything, from every angle...the more detail the better...

Posted: Sat Jan 13, 2007 9:16 pm
by Ambush Commander
One thing which could improve findability is one of those Wikipedia-style article contents things.
I'll add it when I finish. They're a pain to generate, a pain to style semantically correctly, and I haven't created a neat script that does it automatically.
php_value default_charset "UTF-8"
Hmm... I'll throw it in for the sake of comprehensiveness, but it requires PHP to be running as an Apache module (usually not the case). Plus, it's a dependency on two systems: Apache and PHP, that while common I'd like to shy away from. A technique can depend solely on PHP, or it can depend solely on Apache, but depending on both is asking for trouble. Plus, header() is usually a reasonable proposition if you're using a front controller architecture or an old-fashioned common include file.
Oh and thanks for the section on XML. It taught me lots I didn't know.
Good to hear.
the more detail the better...
Well, you can only put in so much info before peoples brains explode...
Posted: Sat Jan 13, 2007 9:47 pm
by Luke
Disscuss everything, from every angle...the more detail the better...
Well for certain audiences, this is a good idea, but I think Ambush Commander was going for a highly effective, yet quick-and-dirty explanation of the subject.
One of the largest problems with designing content for the internet is that there is already so much of it out there. It is so easy for your reader to push the little X on the top right and find another source of information. You want to be sure not to scare them off with a novel on the subject.
AC, I think you've done a nice job of finding that balance between giving a lot of detail, and not boring the reader.

Posted: Sun Jan 14, 2007 4:23 am
by Ollie Saunders
Hmm... I'll throw it in for the sake of comprehensiveness, but it requires PHP to be running as an Apache module (usually not the case). Plus, it's a dependency on two systems: Apache and PHP, that while common I'd like to shy away from. A technique can depend solely on PHP, or it can depend solely on Apache, but depending on both is asking for trouble. Plus, header() is usually a reasonable proposition if you're using a front controller architecture or an old-fashioned common include file.
You can just talk about default_charset in php.ini instead then

Posted: Sun Jan 14, 2007 6:15 am
by matthijs
Ambush Commander wrote:@matthijs: I've incorporated your suggestion, but I'd like to tell you that AddDefaultCharset is a very blunt weapon, and you may want to let META tags take precedence by setting it Off.
Well, I was just mentioning this because I found it a bit misleading in your (original) text. It came across as if just setting the META tag would be enough, while that obviously isn't the case. But as I read the current text that's totally clear, so either I didn't read the original text well or you changed it. Either way it's ok now
Ambush Commander wrote:I'm wondering whether or not it's in the scope of this tutorial to also discuss editors and setting it to UTF-8. It's definitely a major pitfall, but I'm going off the heuristic: "Are users complaining about garbled text?" I am assuming that they've already got something working, and would like to fix it. Maybe when I talk about migrating to UTF-8. Same goes with BOM.
I would definitely mention it. Because many people (?) will use a text editor to edit HTML, set their META tag to UTF-8 thinking they do the right thing, not knowing their HTML files are saved as some other encoding.. oops.