Parsing URL only according to URI : Security risk?

Discussions of secure PHP coding. Security in software is important, so don't be afraid to ask. And when answering: be anal. Nitpick. No security vulnerability is too small.

Moderator: General Moderators

Post Reply
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Parsing URL only according to URI : Security risk?

Post by Ambush Commander »

URI is a generalized version of a URL (or HTTP scheme), and as such has a generic syntax. After validating the set of allowed schemas, and validating the rest of the URI according to the ABNFs specified in the RFC, can the URI still present a security risk, even if it is invalid?

Example: http://.........../ -- Valid according to the URI RFC, however, DNS (RFC1034 and RFC1123) place additional constraints on the value. It's obviously wrong, but note that Firefox only complains "Firefox can't find the server at ............" whereas a URI like "745://asdf" is marked invalid because a scheme cannot start with a digit.

Example: mailto:bob@example.com -- Valid according to the URI RFC, however, only because bob@example.com becomes the path (WTF?). Obviously, there's also restrictions on the format of the email.

My question is can I get away with not further defining valid formats for all the different schemes? And if I can't get away with it, is there any point in letting the user define more schemes without also defining the validation routine for it?

One last note: in section 7, Security Considerations, the RFC lays out a whole bunch of possible considerations, but notes that "A URI does not in itself pose a security threat." javascript:, anyone?

Perhaps I should rephrase my question: how do YOU validate your URIs?

And yes, this is about HTMLPurifier. I'd rather not have any dependencies on third-party libraries.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

My question is can I get away with not further defining valid formats for all the different schemes? And if I can't get away with it, is there any point in letting the user define more schemes without also defining the validation routine for it?
Personally, I would take half-hearted approach: define scheme specific checks for most common schemes, while allowing users to add more schemes (with required validation routines). If the scheme is unknown, proceed with general URI syntax check, otherwise perform scheme-specific checks.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

That makes sense... but now I need to create a URIValidatorRegistry...
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Okay, I've actually started writing the URI class, and I really want to get lazy. After all, web-browsers already handle most of this stuff. Is there any need for me to collapse the dots in paths or fix malformed percent encodings?
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

dot collapsing seems like an overkill while fixing percent encodings... that would be really nice.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Hmm, while I'm at it, let me throw up a few more "extra features". I have decide what I'm going to implement and what I'm not.

* Dot collapsing
* Fixing malformed percent encodings
* Optional stripping of userinfo part of authority
* Normalizing percent encodings to uppercase form
* Validate IPv4 addresses
* Validate IPv6 addresses
* Validate IPvFuture addresses
* Punycode international domain names to adhere with spec (probably not a good idea)
* Translate percent encodings in host to their respective UTF-8 characters for IDNA
* Validate ports: 1 to 65536
* Remove port if it's the common one: i.e. remove port 80 from http requests

It'll take too long to implement all of these. Are there any concrete risks associated with not doing them? And which features would you like to see implemented?
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

How much of that does Apache/PHP already do before you get the values?
(#10850)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Since this URI validation is directed to user submitted input, PHP and Apache don't touch it.

However, it's interesting to note that Apache doesn't accept incorrect percent-encodings: it just says that the request was malformed. This implies that I shouldn't fix them. And a lot of this stuff gets done by browsers, so as long as the browser doesn't do anything strange, things should work fine.

Some of these are normalization, which the RFC recommends be done by anything that outputs URIs. Others are user-configurable features if they want to restrict the possible URIs.

As Weirdan noted earlier, you probably should go a little further than just checking the scheme at the beginning of the URI. But how far...
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

This is how I see it:
  • Dot collapsing [optional]
  • Fixing malformed percent encodings [do]
  • Optional stripping of userinfo part of authority [only optional]
  • Normalizing percent encodings to uppercase form [optional]
  • Validate IPv4 addresses [do]
  • Validate IPv6 addresses [do]
  • Validate IPvFuture addresses [optional]
  • Punycode international domain names to adhere with spec (probably not a good idea) [not sure at the moment]
  • Translate percent encodings in host to their respective UTF-8 characters for IDNA [ditto]
  • Validate ports: 1 to 65536 [easy part, why not?]
  • Remove port if it's the common one: i.e. remove port 80 from http requests [DON'T]
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Hmm... very good recommendations. I have a few questions though:
Remove port if it's the common one: i.e. remove port 80 from http requests [DON'T]
Does the "Don't" mean that I shouldn't do it in URI (which I shouldn't) or I shouldn't do it at all?) According to RFC3986 section 6.2.3 "Scheme-Based Normalization" when a scheme has specific rules regarding equivalents, such as an empty path equivalent to '/' or default port of 80, normalization actions can be taken.
Punycode international domain names to adhere with spec (probably not a good idea) [not sure at the moment]
The main reason I said "not a good idea" is because the transformation algorithms are complicated and I'd probably be better off using an external library.
Translate percent encodings in host to their respective UTF-8 characters for IDNA
The RFC allows percent encoded values in the host. Currently speaking (and I've only tried this in Firefox), this isn't supported by browsers: after all, DNS doesn't support those characters! RFC-wise, this is the correct way to go about specifying international characters, and theoretically speaking, they should be translated into real characters and then to the ASCII version (punycode), but no one seems to do that.

Interestingly, a percent encoded host will display correctly in the status bar but won't load, while a non-encoded host will display punycode in the status bar and will load.

I'm probably going to take your other recommendations.
Post Reply