Hairy Percent Encoding
Posted: Tue Nov 07, 2006 10:12 am
Okay, so I've done research, come to conclusions, but I don't think they're very good conclusions, so I'm asking for a second opinion.
Percent encoding in URIs. The idea is quite simple: if you want to use a character that has special meaning in the URI, you have to encode it. But when do you decode these percent encoding?
The RFC has a dedicated section (2.4) on "When to Encode or Decode", but isn't really helpful. It says:
But we can, as the generic URI syntax specifies the meaning of "/" in path. What this means is that I should already be parsing the path for the scheme validators, and those validators don't have to check up on the slashes.
So now I've got two completely inconsistent processing mechanisms for percent-encoding in path and query. Makes me wonder whether or not I should just not and throw the whole kaboodle at the lower-level peoples to handle. Which is what I'm doing right now, except they're not handling it.
Perhaps we should resolve all unreserved characters and illegal characters while keeping reserved characters intact for the sub-schemes to handle. Illegal characters include spaces, UTF-8 characters, backslashes, carets, etc. Good idea?
Percent encoding in URIs. The idea is quite simple: if you want to use a character that has special meaning in the URI, you have to encode it. But when do you decode these percent encoding?
The RFC has a dedicated section (2.4) on "When to Encode or Decode", but isn't really helpful. It says:
Yes, yes, but do I have to decode the percent-encoded octets? And when am I allowed to decode the octets? The question is "When is a significant character not significant anymore?" The latter question is scheme specific: we'll get different answers according to the different schemes we defer to. Which means we can't do anything for path or query, which, fate should ordain, were the most important ones.When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded...
But we can, as the generic URI syntax specifies the meaning of "/" in path. What this means is that I should already be parsing the path for the scheme validators, and those validators don't have to check up on the slashes.
So now I've got two completely inconsistent processing mechanisms for percent-encoding in path and query. Makes me wonder whether or not I should just not and throw the whole kaboodle at the lower-level peoples to handle. Which is what I'm doing right now, except they're not handling it.
Perhaps we should resolve all unreserved characters and illegal characters while keeping reserved characters intact for the sub-schemes to handle. Illegal characters include spaces, UTF-8 characters, backslashes, carets, etc. Good idea?