Working on a tutorial for UTF-8
Moderator: General Moderators
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Working on a tutorial for UTF-8
I know, I know, there's already a profusion of UTF-8 related advice out there on the web. What I'm trying to do, however, is glue it all together into a monster tutorial that will talk about first what character encodings are, how to fix the easy mistakes, then why it's bad not to UTF-8, and then tips to keep in mind if you decide to migrate.
Here's my first draft: http://hp.jpsband.org/live/docs/enduser-utf8.html Soliciting comments!
Here's my first draft: http://hp.jpsband.org/live/docs/enduser-utf8.html Soliciting comments!
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
I'm not sure I totally understand multiple language support in web applications - but from what i've read it was a good start.
The above is I assume, the catch all charset declaration? This would word with any language?
The latter is setting the page for rendering using the Arabic alphabet - so what would happen if I used that charset but entered everything in English? Would everything get scrambled?
Just so I understand the process...
Your browser loads a page and checks for that META tag and based on that charset declaration it consumes the bytes of the document (minus the tags?) and renders them accordingly?
A = 64 (in ASCII from what I remember - assume it's correct for the sake of argument)
B = 65
C = 66
So when a browser encounters those three bytes consecutively by default it renders using ASCII so a web page would output:
ABC
What if your system locale is set to Thai? would the characters appear as gibberish little boxes?
When I browse the web and cross paths with a Japanese encoded document I'm prompoted to download language files. Those are the tables which tell the font rendering engine what "bytes" correspond to what "character"
I'm guessing fonts like "Tahoma" don't render in Japanese or Thai, etc? or does it depend on the font?
Sorry if my questions are trivial but I hate not fully understanding a subject - especially when it's an important one like this.
Cheers
[/b]
Code: Select all
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />Code: Select all
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1256" />Just so I understand the process...
Your browser loads a page and checks for that META tag and based on that charset declaration it consumes the bytes of the document (minus the tags?) and renders them accordingly?
A = 64 (in ASCII from what I remember - assume it's correct for the sake of argument)
B = 65
C = 66
So when a browser encounters those three bytes consecutively by default it renders using ASCII so a web page would output:
ABC
What if your system locale is set to Thai? would the characters appear as gibberish little boxes?
When I browse the web and cross paths with a Japanese encoded document I'm prompoted to download language files. Those are the tables which tell the font rendering engine what "bytes" correspond to what "character"
I'm guessing fonts like "Tahoma" don't render in Japanese or Thai, etc? or does it depend on the font?
Sorry if my questions are trivial but I hate not fully understanding a subject - especially when it's an important one like this.
Cheers
It's not the META tag that's most important. Even if you have set the META tag to UTF-8 for example, if your server sends different headers you still end up with a different encoding
This article just got published, very long and good read.
http://www.sitepoint.com/article/guide- ... r-encodingNote, however, that any real HTTP header will override a META element, so it's imperative that you set up the web server correctly. For Apache, you can do so by editing the configuration file (/etc/httpd.conf on most *nix systems). The directive should look something like this:
AddDefaultCharset UTF-8
This article just got published, very long and good read.
Nice tutorial! I'm definitely no expert on encoding, but good idea listing the "IE's Descriptions". If I were to read this tutorial as an IE user, without that list, I would be very frustrated. A few recommendations:
- Explain html form accept-encoding attribute and maybe get into database and other storage encoding information
Don't you also have to make sure that your editor is set to utf-8 (or whatever encoding you're using) as well?
I'm not sure why it's entitled "UTF-8" because it is more of a general encoding overview as far as I can tell
I would put a heading on the explanation of formatting... something along the lines of:article wrote:How to read this overview/tutorial
Text in this formatting is an aside, interesting tidbits for the curious but not strictly necessary material to do the tutorial. If you read this text, you'll come out with a greater understanding of the underlying issues.
Yeah, I forgot to thank you for your efforts Ambush. It's a good tutorial.
Before I knew a bit about encoding I used a text editor on my windows machine which outputted ISO-8859-something or Windows-something, I can't remember, while I was assuming I used UTF-8 by setting the META tags to UTF-8. Was I wrong! No wonder I had some problems with strange characters in webpages. So The Ninja is right to say/ask:
Before I knew a bit about encoding I used a text editor on my windows machine which outputted ISO-8859-something or Windows-something, I can't remember, while I was assuming I used UTF-8 by setting the META tags to UTF-8. Was I wrong! No wonder I had some problems with strange characters in webpages. So The Ninja is right to say/ask:
Don't you also have to make sure that your editor is set to utf-8 (or whatever encoding you're using) as well?
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
I'm partially agreeing with matthijs. The meta tag is important, so is the Content-Type header and you should also specify the accept-charset attribute on the form tag.
AC: Nice job esp. with the IE -> real table but I have a few points. I can't help feeling by not discussing UTF-8's relationship with forms your tutorial is incomplete. Since HtmlPurfier is likely to be used on input from forms. And also:
AC: Nice job esp. with the IE -> real table but I have a few points. I can't help feeling by not discussing UTF-8's relationship with forms your tutorial is incomplete. Since HtmlPurfier is likely to be used on input from forms. And also:
I'm not sure the preceding sentence to that list introduces it correctly. The sentence sounds more interesting than that list turns out to be.AC's tutorial wrote:For now, take note if your META tag claims that either:
1. The character encoding is the same as the one reported by the browser,
2. The character encoding is different from the browser's, or
3. There is no META tag at all! (horror, horror!)
- Maugrim_The_Reaper
- DevNet Master
- Posts: 2704
- Joined: Tue Nov 02, 2004 5:43 am
- Location: Ireland
Mismatch between actual encoding and the META/Content-Type is one of the most common problems - even for major websites. I'd cover the issue a little more. Another common problem is that editors open files which contain the common characters between ASCII and UTF-8, and call them ASCII (even if saved previously in UTF-8). Largely the problem is because the text contains no actual >255 characters, the editor assumes the encoding.
As an introductory tutorial, it needs a bit more body - but it's a good entry point for people unfamiliar or even unaware of the problem.
As an introductory tutorial, it needs a bit more body - but it's a good entry point for people unfamiliar or even unaware of the problem.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Ach, it's not done yet!
A lot of the things you guys mentioned I'm planning on adding, but you also mentioned some things that I wasn't thinking about and will toss in now.
@The Ninja Space Goat: Your suggestions are good
Technically speaking, yes. UTF-8, however, will not magically make other character encodings work.The above is I assume, the catch all charset declaration? This would word with any language?
The browser first checks the server content-type (thus the distinction between real and embedded encoding), then goes to the META tag if that was unsuccessful.Your browser loads a page and checks for that META tag and based on that charset declaration it consumes the bytes of the document (minus the tags?) and renders them accordingly?
ASCII characters usually turn out okay because most character encodings preserve 1-128 as the regular characters. For the more funky ones, yes, the characters would become gibberish.What if your system locale is set to Thai? would the characters appear as gibberish little boxes?
Usually not. If a browser is smart, it'll automatically switch to another font that does support Japanese or Thai glyphs, but Internet Explorer doesn't do this, so you have to use Unicode-friendly fonts.I'm guessing fonts like "Tahoma" don't render in Japanese or Thai, etc? or does it depend on the font?
I'll look into it.http://www.sitepoint.com/article/guide- ... r-encoding
This article just got published, very long and good read.
@The Ninja Space Goat: Your suggestions are good
Well, the point is to get people to move to UTF-8. I can change it around if necessary.I'm not sure why it's entitled "UTF-8" because it is more of a general encoding overview as far as I can tell
Forms actually become quite simple as long as the containing page is in UTF-8. They get really weird when it's not. But yeah, I'll go over that.I can't help feeling by not discussing UTF-8's relationship with forms your tutorial is incomplete.
Hmm... maybe I can massage that sentence into something more appropriate.I'm not sure the preceding sentence to that list introduces it correctly. The sentence sounds more interesting than that list turns out to be.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Okay, I've finished the portion on figuring out the encodings of your pages and making sure they are consistent. This is a self-contained section in-and-of-itself, even without discussion of UTF-8. Will be continuing on later.
@matthijs: I've incorporated your suggestion, but I'd like to tell you that AddDefaultCharset is a very blunt weapon, and you may want to let META tags take precedence by setting it Off.
I'm wondering whether or not it's in the scope of this tutorial to also discuss editors and setting it to UTF-8. It's definitely a major pitfall, but I'm going off the heuristic: "Are users complaining about garbled text?" I am assuming that they've already got something working, and would like to fix it. Maybe when I talk about migrating to UTF-8. Same goes with BOM.
@matthijs: I've incorporated your suggestion, but I'd like to tell you that AddDefaultCharset is a very blunt weapon, and you may want to let META tags take precedence by setting it Off.
I'm wondering whether or not it's in the scope of this tutorial to also discuss editors and setting it to UTF-8. It's definitely a major pitfall, but I'm going off the heuristic: "Are users complaining about garbled text?" I am assuming that they've already got something working, and would like to fix it. Maybe when I talk about migrating to UTF-8. Same goes with BOM.
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
That sentence I spoke about is much better now 
Also I'd like to commend you on the excellent use of layout and typography and superb use of semantic HTML, neither of which I noticed first time. Then again I'd expect no less from you AC
. One thing which could improve findability is one of those Wikipedia-style article contents things.
Hint: You might want to also mentionOh and thanks for the section on XML. It taught me lots I didn't know.
Also I'd like to commend you on the excellent use of layout and typography and superb use of semantic HTML, neither of which I noticed first time. Then again I'd expect no less from you AC
Hint: You might want to also mention
Code: Select all
php_value default_charset "UTF-8"-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
I'll add it when I finish. They're a pain to generate, a pain to style semantically correctly, and I haven't created a neat script that does it automatically.One thing which could improve findability is one of those Wikipedia-style article contents things.
Hmm... I'll throw it in for the sake of comprehensiveness, but it requires PHP to be running as an Apache module (usually not the case). Plus, it's a dependency on two systems: Apache and PHP, that while common I'd like to shy away from. A technique can depend solely on PHP, or it can depend solely on Apache, but depending on both is asking for trouble. Plus, header() is usually a reasonable proposition if you're using a front controller architecture or an old-fashioned common include file.php_value default_charset "UTF-8"
Good to hear.Oh and thanks for the section on XML. It taught me lots I didn't know.
Well, you can only put in so much info before peoples brains explode...the more detail the better...
Well for certain audiences, this is a good idea, but I think Ambush Commander was going for a highly effective, yet quick-and-dirty explanation of the subject.Disscuss everything, from every angle...the more detail the better...
One of the largest problems with designing content for the internet is that there is already so much of it out there. It is so easy for your reader to push the little X on the top right and find another source of information. You want to be sure not to scare them off with a novel on the subject.
AC, I think you've done a nice job of finding that balance between giving a lot of detail, and not boring the reader.
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
You can just talk about default_charset in php.ini instead thenHmm... I'll throw it in for the sake of comprehensiveness, but it requires PHP to be running as an Apache module (usually not the case). Plus, it's a dependency on two systems: Apache and PHP, that while common I'd like to shy away from. A technique can depend solely on PHP, or it can depend solely on Apache, but depending on both is asking for trouble. Plus, header() is usually a reasonable proposition if you're using a front controller architecture or an old-fashioned common include file.
Well, I was just mentioning this because I found it a bit misleading in your (original) text. It came across as if just setting the META tag would be enough, while that obviously isn't the case. But as I read the current text that's totally clear, so either I didn't read the original text well or you changed it. Either way it's ok nowAmbush Commander wrote:@matthijs: I've incorporated your suggestion, but I'd like to tell you that AddDefaultCharset is a very blunt weapon, and you may want to let META tags take precedence by setting it Off.
I would definitely mention it. Because many people (?) will use a text editor to edit HTML, set their META tag to UTF-8 thinking they do the right thing, not knowing their HTML files are saved as some other encoding.. oops.Ambush Commander wrote:I'm wondering whether or not it's in the scope of this tutorial to also discuss editors and setting it to UTF-8. It's definitely a major pitfall, but I'm going off the heuristic: "Are users complaining about garbled text?" I am assuming that they've already got something working, and would like to fix it. Maybe when I talk about migrating to UTF-8. Same goes with BOM.