Reporting errors in a document without line numbers

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Reporting errors in a document without line numbers

Post by Ambush Commander »

A filter, to some degree, changes user input. When the user input is small, this is not that much trouble, but it still can be surprising to end-users. It's more obvious on large inputs: for instance, if you perform strip_tags() on user input, people may be surprised when <email@example.com> gets mysteriously removed.

Therefore, even though this may be overstepping the bounds as a filter, I would like to report back to the user precisely what errors where fixed during the filtering process. If you haven't already figured it out, this is for HTML Purifier.

After doing a little investigation, I've concluded that it is impossible to get the DOM extension to assign line numbers to the nodes it parses out. For example: if I have a random DOMElement that I realize has an invalid attribute attached to it, there is no way I can determine where in the original document the invalid attribute was. Such labeling might be possible with a pure-PHP HTML parser, but as of right now, such a feature is vaporware.

This means that the usual format like:
Error on line 13: invalid URI with javascript: protocol
...cannot be done. We'd know that the error had happened, but we wouldn't know where.

Possible solutions:

Omit the line number entirely

The error now looks like:
Invalid URI with javascript: protocol
If this happens multiple times, a naive implementation would list the same error multiple times, but if we wanted to be smart we could stack them up:
Invalid URI with javascript: protocol, occurs 2 times
The trouble with this approach is that it doesn't scale well to large documents. Suppose, for example, we had a twenty page doc. We tried to put it in, and we get:
Stray ampersand without valid trailing identifier, occurs 89 times
Hmm... that could be troublesome.

There is would be some argument over the granularity of the errors: you wouldn't want to output every little thing lest the user be flooded with a deluge of warnings and notices, so should the error reporting system tell the user that a non-SGML character was detected and fixed?

Diff the original and the new version

This lets the user see what precisely the filter changed before letting it go through. Doing it like this, however, is very heavyweight: you'll need to bundle a diff library and corresponding HTML components to make it work. Similarly, it is oriented to power-users: WYSIWYG editors beware!

Don't implement error reporting at all

Definitely the proposal that needs the least labor, we ask the question: is it really helpful to notify the user that their <center> tag was turned into a <div style="text-align:center;">? The filter is acting in the most graceful way possible, and what it's really doing is swallowing up as many errors as possible.

The point of error-reporting, I suppose, is to prevent HTML Purifier from discarding important information from the document. If the user omits a closing angled bracket, they should not be surprised when half their document disappears. However, this is more on the point of well-formedness, and these really dangerous typos are the ones that are the hardest to detect!

Plus, there are hidden costs. The moment you create messages that are displayed to the user, you run into I18N issues. A French website doesn't want to be displaying English error messages.

You'd even wonder: "Wait a second, are people even using this feature at all?"

Circumstances

I have made the initial investment of creating a registry object named Context which gets passed by parameter to essentially every class out there. I have not started the error logging code though. Anyone have sage advice on where to go on from here?
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

I think that showing detailed information could be useful. Perhaps another option would be to have an alternate, but much lower performance parser that was only loaded/run when a error occured or info requested. It seems that the problem is DOM and the functionality is only needed when information is requested -- so performance can be sacraficed.
(#10850)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Indeed, I do have a lower performance parser that could be adapted to work with this (although getting it to work could be a little hairy). It's good to know that someone would find this helpful, so I'll try to make it on-demand.

If multiple configuration directives influence which Lexer to use, how should I specify their precedence?
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

Ambush Commander wrote:If multiple configuration directives influence which Lexer to use, how should I specify their precedence?
Is that a ball of snakes or a kettle of fish!?! ;) It sounds like you might want to add some hinting, but I don't know the system well enough to know where would be the best spot.
(#10850)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Well, up until this point, I thought that there'd really be no legitimate reason for casual end-users to need to be switching what basically is the HTML parser implementation. Currently, what you'd need to do is:

Code: Select all

$prototype = new HTMLPurifier_Lexer_DirectLex();
HTMLPurifier_Lexer::instance($prototype);
If we decide to tie a configurable feature to an extra feature of one of these implementations, we need to transparently configure the choice. I'll think about it. For now, though, I'm going to offer non-line-numbered errors.
Post Reply