Aghast at this gaping hole, I did some research, and it seems that while a null byte is permitted in terms of UTF-8, it is not permitted in terms of SGML (neither are 0 to 31 inclusive and 127 to 159 inclusive).Error Line 21 column 67: non SGML character number 0.
However, people do use these characters as, well, characters, and in the back of my mind I'm thinking that these may be used as surrogate pairs for UTF-8 (I'm probably wrong).
For this reason, I'm not really sure what I should do about these characters after I've made sure the UTF-8 is well formed. Should I:
1. Remove all non-SGML characters
2. Replace all non-SGML characters with their corresponding numeric entities
3. Something else?
And, I also wonder, how should I go about doing these things?