Hairy Percent Encoding

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Hairy Percent Encoding

Post by Ambush Commander »

Okay, so I've done research, come to conclusions, but I don't think they're very good conclusions, so I'm asking for a second opinion.

Percent encoding in URIs. The idea is quite simple: if you want to use a character that has special meaning in the URI, you have to encode it. But when do you decode these percent encoding?

The RFC has a dedicated section (2.4) on "When to Encode or Decode", but isn't really helpful. It says:
When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded...
Yes, yes, but do I have to decode the percent-encoded octets? And when am I allowed to decode the octets? The question is "When is a significant character not significant anymore?" The latter question is scheme specific: we'll get different answers according to the different schemes we defer to. Which means we can't do anything for path or query, which, fate should ordain, were the most important ones.

But we can, as the generic URI syntax specifies the meaning of "/" in path. What this means is that I should already be parsing the path for the scheme validators, and those validators don't have to check up on the slashes.

So now I've got two completely inconsistent processing mechanisms for percent-encoding in path and query. Makes me wonder whether or not I should just not and throw the whole kaboodle at the lower-level peoples to handle. Which is what I'm doing right now, except they're not handling it.

:?: :?: :?:

Perhaps we should resolve all unreserved characters and illegal characters while keeping reserved characters intact for the sub-schemes to handle. Illegal characters include spaces, UTF-8 characters, backslashes, carets, etc. Good idea?
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

I am not sure if I understand your question completely (and certainly not very deeply), but it seems like the definition is simply saying that the scheme delimeters (if they exist) must me identified and the URI separated into it parts before the parts are decoded. It sounds like the issue is precidence.
(#10850)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Yep. I'm trying to figure out how to get all the precedence nuances correct without writing an exorbitant amount of code or crippling performance. I'm already running without the ability to use urlencode() or rawurlencode(), which is a big problem.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

Ah yes ... in this corner the reigning champion Standards Nuances and in the other corner the tag team of Exorbitant Amount of Code and Crippling Performance... :)

It sounds like you need come up with some way to define the scheme patterns (along with their nuances) so that you can isolate the parsing and separating.
(#10850)
Post Reply