Page 1 of 1
Non-SGML characters
Posted: Thu Aug 17, 2006 5:07 pm
by Ambush Commander
I'm all about how HTMLPurifier will always produce standards-compliant code etc etc etc but when I started feeding it interesting stuff (such as null bytes and control characters) the validator started yelling at me...
Error Line 21 column 67: non SGML character number 0.
Aghast at this gaping hole, I did some research, and it seems that while a null byte is permitted in terms of UTF-8, it is not permitted in terms of SGML (neither are 0 to 31 inclusive and 127 to 159 inclusive).
However, people do use these characters as, well, characters, and in the back of my mind I'm thinking that these may be used as surrogate pairs for UTF-8 (I'm probably wrong).
For this reason, I'm not really sure what I should do about these characters after I've made sure the UTF-8 is well formed. Should I:
1. Remove all non-SGML characters
2. Replace all non-SGML characters with their corresponding numeric entities
3. Something else?
And, I also wonder, how should I go about doing these things?
Posted: Thu Aug 17, 2006 5:16 pm
by feyd
Replace for the printable, but remove for the non-printable.
Don't know if it'd help or not but a while ago I posted a UTF-8 building toy:
viewtopic.php?p=191404#191404
Obviously that was before I did Unit Testing in PHP.

Posted: Thu Aug 17, 2006 7:02 pm
by Ambush Commander
Replace for the printable, but remove for the non-printable.
Okay, the specific SGML exclusions work out nicely since they're all non-printable. Well, make an exception for 9 (tab), 10 (line feed) and 13 (carriage return).
Bytes higher than 159 might be troublesome, since they may be valid or invalid UTF-8 depending on context, but as long as I use iconv, mbstring, some utf8 validator function, or DOM (which seems to handle those gracefully), I think I'll be okay.
I hope.
:-/
Posted: Thu Aug 17, 2006 7:08 pm
by feyd
Taking the function in my linked post, one could reverse it to convert UTF-8 characters into HTML entitiy forms. I don't know how difficult that would be considering it's been a while since I have even looked at the code there, but it may be something to consider.
Posted: Thu Aug 17, 2006 7:10 pm
by Ambush Commander
Well, the idea is to output as much as possible in pure UTF-8, skimping on the numeric entities.
Posted: Thu Aug 17, 2006 7:16 pm
by feyd
Ambush Commander wrote:Well, the idea is to output as much as possible in pure UTF-8, skimping on the numeric entities.
Hmm, what about taking the utf8 back to entity (i.e. interger) deciding from there whether to keep it or throw it? I haven't taken the time to dive through mbstring and iconv enough to know if there's a function that can perform that already.
Hmm, something to definitely think about while I'm trying to build more (unrelated) tests.
Posted: Thu Aug 17, 2006 8:12 pm
by Ambush Commander
Hmm, what about taking the utf8 back to entity (i.e. interger) deciding from there whether to keep it or throw it? I haven't taken the time to dive through mbstring and iconv enough to know if there's a function that can perform that already.
That's precisely what many Unicode compatibility libraries do. Convert a string into an array of integers. Of course, it sounds and probably is slow, so if I can get iconv or mbstring to make it well-formed first, that would be delightful.
Hmm... the thing is that those libraries won't mind a null byte or two in a UTF-8 stream, so I might have to decompose it anyway (or I could do a massive str_replace)... choices, choices...
Posted: Thu Aug 17, 2006 11:22 pm
by Ambush Commander
Hmph, so these disallowed characters CAN be found later on in UTF-8 characters. Looks like I'll be using your function again Feyd.

Posted: Thu Aug 17, 2006 11:26 pm
by feyd
Posted: Thu Aug 17, 2006 11:36 pm
by Ambush Commander
Hmm... while this'll be great for parsing numeric entities, it doesn't look like it's suited for filtering out actual Unicode characters that aren't escaped. Is that correct?
Posted: Thu Aug 17, 2006 11:41 pm
by feyd
The current state is to convert entities to UTF8 octets, however the math involved can be reversed to convert an octet-set to integer.
Posted: Fri Aug 18, 2006 3:07 pm
by Ambush Commander
I was never good at bit-wise math.

I've gotten my hands on another UTF8 library that converts a UTF8 string into an array of Unicode codepoints.
Posted: Fri Aug 18, 2006 3:13 pm
by feyd
Fair enough.
