Non-SGML characters

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Non-SGML characters

Post by Ambush Commander »

I'm all about how HTMLPurifier will always produce standards-compliant code etc etc etc but when I started feeding it interesting stuff (such as null bytes and control characters) the validator started yelling at me...
Error Line 21 column 67: non SGML character number 0.
Aghast at this gaping hole, I did some research, and it seems that while a null byte is permitted in terms of UTF-8, it is not permitted in terms of SGML (neither are 0 to 31 inclusive and 127 to 159 inclusive).

However, people do use these characters as, well, characters, and in the back of my mind I'm thinking that these may be used as surrogate pairs for UTF-8 (I'm probably wrong).

For this reason, I'm not really sure what I should do about these characters after I've made sure the UTF-8 is well formed. Should I:

1. Remove all non-SGML characters
2. Replace all non-SGML characters with their corresponding numeric entities
3. Something else?

And, I also wonder, how should I go about doing these things?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Replace for the printable, but remove for the non-printable.

Don't know if it'd help or not but a while ago I posted a UTF-8 building toy: viewtopic.php?p=191404#191404

Obviously that was before I did Unit Testing in PHP. :)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Replace for the printable, but remove for the non-printable.
Okay, the specific SGML exclusions work out nicely since they're all non-printable. Well, make an exception for 9 (tab), 10 (line feed) and 13 (carriage return).

Bytes higher than 159 might be troublesome, since they may be valid or invalid UTF-8 depending on context, but as long as I use iconv, mbstring, some utf8 validator function, or DOM (which seems to handle those gracefully), I think I'll be okay.

I hope.

:-/
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Taking the function in my linked post, one could reverse it to convert UTF-8 characters into HTML entitiy forms. I don't know how difficult that would be considering it's been a while since I have even looked at the code there, but it may be something to consider.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Well, the idea is to output as much as possible in pure UTF-8, skimping on the numeric entities.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Ambush Commander wrote:Well, the idea is to output as much as possible in pure UTF-8, skimping on the numeric entities.
Hmm, what about taking the utf8 back to entity (i.e. interger) deciding from there whether to keep it or throw it? I haven't taken the time to dive through mbstring and iconv enough to know if there's a function that can perform that already.

Hmm, something to definitely think about while I'm trying to build more (unrelated) tests.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Hmm, what about taking the utf8 back to entity (i.e. interger) deciding from there whether to keep it or throw it? I haven't taken the time to dive through mbstring and iconv enough to know if there's a function that can perform that already.
That's precisely what many Unicode compatibility libraries do. Convert a string into an array of integers. Of course, it sounds and probably is slow, so if I can get iconv or mbstring to make it well-formed first, that would be delightful.

Hmm... the thing is that those libraries won't mind a null byte or two in a UTF-8 stream, so I might have to decompose it anyway (or I could do a massive str_replace)... choices, choices...
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Hmph, so these disallowed characters CAN be found later on in UTF-8 characters. Looks like I'll be using your function again Feyd. ;-)
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Image
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Hmm... while this'll be great for parsing numeric entities, it doesn't look like it's suited for filtering out actual Unicode characters that aren't escaped. Is that correct?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

The current state is to convert entities to UTF8 octets, however the math involved can be reversed to convert an octet-set to integer.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I was never good at bit-wise math. :-( I've gotten my hands on another UTF8 library that converts a UTF8 string into an array of Unicode codepoints.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Fair enough. :)
Post Reply