Therefore, even though this may be overstepping the bounds as a filter, I would like to report back to the user precisely what errors where fixed during the filtering process. If you haven't already figured it out, this is for HTML Purifier.
After doing a little investigation, I've concluded that it is impossible to get the DOM extension to assign line numbers to the nodes it parses out. For example: if I have a random DOMElement that I realize has an invalid attribute attached to it, there is no way I can determine where in the original document the invalid attribute was. Such labeling might be possible with a pure-PHP HTML parser, but as of right now, such a feature is vaporware.
This means that the usual format like:
...cannot be done. We'd know that the error had happened, but we wouldn't know where.Error on line 13: invalid URI with javascript: protocol
Possible solutions:
Omit the line number entirely
The error now looks like:
If this happens multiple times, a naive implementation would list the same error multiple times, but if we wanted to be smart we could stack them up:Invalid URI with javascript: protocol
The trouble with this approach is that it doesn't scale well to large documents. Suppose, for example, we had a twenty page doc. We tried to put it in, and we get:Invalid URI with javascript: protocol, occurs 2 times
Hmm... that could be troublesome.Stray ampersand without valid trailing identifier, occurs 89 times
There is would be some argument over the granularity of the errors: you wouldn't want to output every little thing lest the user be flooded with a deluge of warnings and notices, so should the error reporting system tell the user that a non-SGML character was detected and fixed?
Diff the original and the new version
This lets the user see what precisely the filter changed before letting it go through. Doing it like this, however, is very heavyweight: you'll need to bundle a diff library and corresponding HTML components to make it work. Similarly, it is oriented to power-users: WYSIWYG editors beware!
Don't implement error reporting at all
Definitely the proposal that needs the least labor, we ask the question: is it really helpful to notify the user that their <center> tag was turned into a <div style="text-align:center;">? The filter is acting in the most graceful way possible, and what it's really doing is swallowing up as many errors as possible.
The point of error-reporting, I suppose, is to prevent HTML Purifier from discarding important information from the document. If the user omits a closing angled bracket, they should not be surprised when half their document disappears. However, this is more on the point of well-formedness, and these really dangerous typos are the ones that are the hardest to detect!
Plus, there are hidden costs. The moment you create messages that are displayed to the user, you run into I18N issues. A French website doesn't want to be displaying English error messages.
You'd even wonder: "Wait a second, are people even using this feature at all?"
Circumstances
I have made the initial investment of creating a registry object named Context which gets passed by parameter to essentially every class out there. I have not started the error logging code though. Anyone have sage advice on where to go on from here?