Parsing URL only according to URI : Security risk?
Moderator: General Moderators
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Parsing URL only according to URI : Security risk?
URI is a generalized version of a URL (or HTTP scheme), and as such has a generic syntax. After validating the set of allowed schemas, and validating the rest of the URI according to the ABNFs specified in the RFC, can the URI still present a security risk, even if it is invalid?
Example: http://.........../ -- Valid according to the URI RFC, however, DNS (RFC1034 and RFC1123) place additional constraints on the value. It's obviously wrong, but note that Firefox only complains "Firefox can't find the server at ............" whereas a URI like "745://asdf" is marked invalid because a scheme cannot start with a digit.
Example: mailto:bob@example.com -- Valid according to the URI RFC, however, only because bob@example.com becomes the path (WTF?). Obviously, there's also restrictions on the format of the email.
My question is can I get away with not further defining valid formats for all the different schemes? And if I can't get away with it, is there any point in letting the user define more schemes without also defining the validation routine for it?
One last note: in section 7, Security Considerations, the RFC lays out a whole bunch of possible considerations, but notes that "A URI does not in itself pose a security threat." javascript:, anyone?
Perhaps I should rephrase my question: how do YOU validate your URIs?
And yes, this is about HTMLPurifier. I'd rather not have any dependencies on third-party libraries.
Example: http://.........../ -- Valid according to the URI RFC, however, DNS (RFC1034 and RFC1123) place additional constraints on the value. It's obviously wrong, but note that Firefox only complains "Firefox can't find the server at ............" whereas a URI like "745://asdf" is marked invalid because a scheme cannot start with a digit.
Example: mailto:bob@example.com -- Valid according to the URI RFC, however, only because bob@example.com becomes the path (WTF?). Obviously, there's also restrictions on the format of the email.
My question is can I get away with not further defining valid formats for all the different schemes? And if I can't get away with it, is there any point in letting the user define more schemes without also defining the validation routine for it?
One last note: in section 7, Security Considerations, the RFC lays out a whole bunch of possible considerations, but notes that "A URI does not in itself pose a security threat." javascript:, anyone?
Perhaps I should rephrase my question: how do YOU validate your URIs?
And yes, this is about HTMLPurifier. I'd rather not have any dependencies on third-party libraries.
Personally, I would take half-hearted approach: define scheme specific checks for most common schemes, while allowing users to add more schemes (with required validation routines). If the scheme is unknown, proceed with general URI syntax check, otherwise perform scheme-specific checks.My question is can I get away with not further defining valid formats for all the different schemes? And if I can't get away with it, is there any point in letting the user define more schemes without also defining the validation routine for it?
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Hmm, while I'm at it, let me throw up a few more "extra features". I have decide what I'm going to implement and what I'm not.
* Dot collapsing
* Fixing malformed percent encodings
* Optional stripping of userinfo part of authority
* Normalizing percent encodings to uppercase form
* Validate IPv4 addresses
* Validate IPv6 addresses
* Validate IPvFuture addresses
* Punycode international domain names to adhere with spec (probably not a good idea)
* Translate percent encodings in host to their respective UTF-8 characters for IDNA
* Validate ports: 1 to 65536
* Remove port if it's the common one: i.e. remove port 80 from http requests
It'll take too long to implement all of these. Are there any concrete risks associated with not doing them? And which features would you like to see implemented?
* Dot collapsing
* Fixing malformed percent encodings
* Optional stripping of userinfo part of authority
* Normalizing percent encodings to uppercase form
* Validate IPv4 addresses
* Validate IPv6 addresses
* Validate IPvFuture addresses
* Punycode international domain names to adhere with spec (probably not a good idea)
* Translate percent encodings in host to their respective UTF-8 characters for IDNA
* Validate ports: 1 to 65536
* Remove port if it's the common one: i.e. remove port 80 from http requests
It'll take too long to implement all of these. Are there any concrete risks associated with not doing them? And which features would you like to see implemented?
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Since this URI validation is directed to user submitted input, PHP and Apache don't touch it.
However, it's interesting to note that Apache doesn't accept incorrect percent-encodings: it just says that the request was malformed. This implies that I shouldn't fix them. And a lot of this stuff gets done by browsers, so as long as the browser doesn't do anything strange, things should work fine.
Some of these are normalization, which the RFC recommends be done by anything that outputs URIs. Others are user-configurable features if they want to restrict the possible URIs.
As Weirdan noted earlier, you probably should go a little further than just checking the scheme at the beginning of the URI. But how far...
However, it's interesting to note that Apache doesn't accept incorrect percent-encodings: it just says that the request was malformed. This implies that I shouldn't fix them. And a lot of this stuff gets done by browsers, so as long as the browser doesn't do anything strange, things should work fine.
Some of these are normalization, which the RFC recommends be done by anything that outputs URIs. Others are user-configurable features if they want to restrict the possible URIs.
As Weirdan noted earlier, you probably should go a little further than just checking the scheme at the beginning of the URI. But how far...
This is how I see it:
- Dot collapsing [optional]
- Fixing malformed percent encodings [do]
- Optional stripping of userinfo part of authority [only optional]
- Normalizing percent encodings to uppercase form [optional]
- Validate IPv4 addresses [do]
- Validate IPv6 addresses [do]
- Validate IPvFuture addresses [optional]
- Punycode international domain names to adhere with spec (probably not a good idea) [not sure at the moment]
- Translate percent encodings in host to their respective UTF-8 characters for IDNA [ditto]
- Validate ports: 1 to 65536 [easy part, why not?]
- Remove port if it's the common one: i.e. remove port 80 from http requests [DON'T]
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Hmm... very good recommendations. I have a few questions though:
Interestingly, a percent encoded host will display correctly in the status bar but won't load, while a non-encoded host will display punycode in the status bar and will load.
I'm probably going to take your other recommendations.
Does the "Don't" mean that I shouldn't do it in URI (which I shouldn't) or I shouldn't do it at all?) According to RFC3986 section 6.2.3 "Scheme-Based Normalization" when a scheme has specific rules regarding equivalents, such as an empty path equivalent to '/' or default port of 80, normalization actions can be taken.Remove port if it's the common one: i.e. remove port 80 from http requests [DON'T]
The main reason I said "not a good idea" is because the transformation algorithms are complicated and I'd probably be better off using an external library.Punycode international domain names to adhere with spec (probably not a good idea) [not sure at the moment]
The RFC allows percent encoded values in the host. Currently speaking (and I've only tried this in Firefox), this isn't supported by browsers: after all, DNS doesn't support those characters! RFC-wise, this is the correct way to go about specifying international characters, and theoretically speaking, they should be translated into real characters and then to the ASCII version (punycode), but no one seems to do that.Translate percent encodings in host to their respective UTF-8 characters for IDNA
Interestingly, a percent encoded host will display correctly in the status bar but won't load, while a non-encoded host will display punycode in the status bar and will load.
I'm probably going to take your other recommendations.