Well, either provide information here or go build it.kaisellgren wrote:Man, that's what I've been planning (get out of my head). Creating a library that comes with real life code examples that tell you exactly how, when and where to use what. From credit cards to allowance of CSS, everything even slightly general would be handled by a single library. Something similar to ESAPI perhaps.
Lugubriousness
Moderator: General Moderators
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Re: Lugubriousness
(#10850)
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Re: Lugubriousness
Sanitize has two goals, one is to whitelist what can be input -- especially with regards to character set. Maybe you trust that your validation is 100% bulletproof. I don't and whitelist as well.Weirdan wrote:Not that sanitize_* has much use to me — I believe incorrect input should be rejected and not attempted to deal with (GIGO principle will bite you sooner or later, and I don't like to be bitten by that GO part).
The second goal is to be nice to users who often make typos. Otherwise you are redisplaying forms, (slowing down shopping, etc.) for petty, correctable errors.
(#10850)
Re: Lugubriousness
One problem I have with the filter extension is its terminology.
a) filtering. That is something which works exactly as a real, physical filter would work. Take your coffee filter: liquid goes through, coffee stuff stays behind. very easy concept. Applied to PHP: I start with a telephone number being entered like "(023)-765432". After filtering you end up having "023765432"
b) validating. Is something valid? In the real world: is it what it is supposed to be? Is "(023)-765432" a valid telephone number, yes or no?
So, the thing that I find annoying about the filter extension is that "filtering" is called "sanitize", while validation becomes a subset of filtering. That's not clear to me. You should just have two sets of functions. FILTER_xxx and VALIDATE_xxx
In my view, you have:There are two main types of filtering: validation and sanitization.
Validation is used to validate or check if the data meets certain qualifications. For example, passing in FILTER_VALIDATE_EMAIL will determine if the data is a valid email address, but will not change the data itself.
Sanitization will sanitize the data, so it may alter it by removing undesired characters. For example, passing in FILTER_SANITIZE_EMAIL will remove characters that are inappropriate for an email address to contain. That said, it does not validate the data.
a) filtering. That is something which works exactly as a real, physical filter would work. Take your coffee filter: liquid goes through, coffee stuff stays behind. very easy concept. Applied to PHP: I start with a telephone number being entered like "(023)-765432". After filtering you end up having "023765432"
b) validating. Is something valid? In the real world: is it what it is supposed to be? Is "(023)-765432" a valid telephone number, yes or no?
So, the thing that I find annoying about the filter extension is that "filtering" is called "sanitize", while validation becomes a subset of filtering. That's not clear to me. You should just have two sets of functions. FILTER_xxx and VALIDATE_xxx
- kaisellgren
- DevNet Resident
- Posts: 1675
- Joined: Sat Jan 07, 2006 5:52 am
- Location: Lahti, Finland.
Re: Lugubriousness
This is what I hate most. People use different terms for the same thing. Some people like to call "Output Escaping" while others use either "Output Encoding" or "Output Filtering". To some people, "Filtering" and "Validating" is not the same, but to some people "Validation" is part of "Filtering". I guess there is no clear "standard" terminology. However, these are what I use in the perspective of a PHP application:
Input
Data that comes from the outside. For example, input is anything that comes from files (fopen), database (queries), user (GPC(FES)), server (FES) and so forth. Internal memory is excluded, but outside memory sources such as APC, Memcached, etc are included.
Output
Data that leaves your application and enters another context. For example, a database query is output. A destroyed variable is not output, because it does not enter any other context. The same thing applies to anything that is added into the output buffer (both header and body), writing to files is output, shell commands, talking to another server with cURL is output, and so on.
Filtering (=Sanitizing)
Removal or modification of variables/data regardless of whether it is just a matter of stripping out a few bytes or changing the entire data type.
Validation
Checking whether data is what it is expected to be. This includes checking of types, lengths, possible data ranges and correct forms of data. Invalid data is rejected. Validation may also include some sort of expiration checking (time/age).
Business Validation
In addition to Validation, it also makes sure the data is also acceptable in business logic; e.g., no one who registers to your website may have a birthday of 10th July 2009 (because he would have born in the future, it's 7th July now) and in fact there's no clear way to draw a line, can a 6 months old baby register to your site? What about 6 years old kid? For security, typical Validation is usually enough.
Escaping
Not altering the meaning of data in any way, but making sure data does not cause unexpected behaviors or misunderstanding in the target context. Usually means dealing with meta characters of the target context.
Cleansing/Cleaning
Data that is escaped/filtered and is safe to use, is cleaned. Cleansing data is to take proper precautions to stay secure.
Integrity
Making sure data has not changed since the last time by the use of a HMAC, for instance.
Compressing
Representing data with fewer bits. Usually used to compress files to gain lesser download times, bandwidth consumption, disc usage, etc.
Encoding
Representing data losslessly in a different way so that it can then be decoded back to the original form without losing anything. Usually used in the web (URIs, emails, etc). There is no big difference between Encoding and Escaping. However, Escaping usually strives to retain the data as original as possible maintaining readability. And when it comes to Escaping and Encoding, it is usually reverted automatically by the outside source (e.g. the web browser decodes the & to a visible & letter for the user, a database server treats \' as ', and so forth) thus making things simpler to implement.
Input
Data that comes from the outside. For example, input is anything that comes from files (fopen), database (queries), user (GPC(FES)), server (FES) and so forth. Internal memory is excluded, but outside memory sources such as APC, Memcached, etc are included.
Output
Data that leaves your application and enters another context. For example, a database query is output. A destroyed variable is not output, because it does not enter any other context. The same thing applies to anything that is added into the output buffer (both header and body), writing to files is output, shell commands, talking to another server with cURL is output, and so on.
Filtering (=Sanitizing)
Removal or modification of variables/data regardless of whether it is just a matter of stripping out a few bytes or changing the entire data type.
Validation
Checking whether data is what it is expected to be. This includes checking of types, lengths, possible data ranges and correct forms of data. Invalid data is rejected. Validation may also include some sort of expiration checking (time/age).
Business Validation
In addition to Validation, it also makes sure the data is also acceptable in business logic; e.g., no one who registers to your website may have a birthday of 10th July 2009 (because he would have born in the future, it's 7th July now) and in fact there's no clear way to draw a line, can a 6 months old baby register to your site? What about 6 years old kid? For security, typical Validation is usually enough.
Escaping
Not altering the meaning of data in any way, but making sure data does not cause unexpected behaviors or misunderstanding in the target context. Usually means dealing with meta characters of the target context.
Cleansing/Cleaning
Data that is escaped/filtered and is safe to use, is cleaned. Cleansing data is to take proper precautions to stay secure.
Integrity
Making sure data has not changed since the last time by the use of a HMAC, for instance.
Compressing
Representing data with fewer bits. Usually used to compress files to gain lesser download times, bandwidth consumption, disc usage, etc.
Encoding
Representing data losslessly in a different way so that it can then be decoded back to the original form without losing anything. Usually used in the web (URIs, emails, etc). There is no big difference between Encoding and Escaping. However, Escaping usually strives to retain the data as original as possible maintaining readability. And when it comes to Escaping and Encoding, it is usually reverted automatically by the outside source (e.g. the web browser decodes the & to a visible & letter for the user, a database server treats \' as ', and so forth) thus making things simpler to implement.
Last edited by kaisellgren on Wed Jul 08, 2009 1:23 pm, edited 1 time in total.
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Re: Lugubriousness
Those are great definitions and I agree that it is essential to agree on the terms before discussing the subject. You definitions are very close to my understanding of the terms. I have a couple of comments on your list and I am sure others will too.
Validation vs Business Validation - I certainly understand the difference, I have just not heard the term "Business Validation." Are there other more common terms for this "Business Logic", "Business Rules" ?
Escaping vs Encoding - This is a fuzzy one. Can we clarify the distinction between:
1) things done to distinguish delimiter characters used between delimiters (.e.g quotes "a\"b" or 'a\'b'),
2) escape sequences which are traditional ways to type non-printing characters using printing characters (e.g. tab, newline, return),
3) character sequences that are used to express a character that has special meaning (e.g. <, >, &),
4) a character that cannot be typed so a character sequence is used to express the character
5) conversion of all characters into a non-UTF/ASCII representation, such as Base64 where the intent is not compression
6) ?
Validation vs Business Validation - I certainly understand the difference, I have just not heard the term "Business Validation." Are there other more common terms for this "Business Logic", "Business Rules" ?
Escaping vs Encoding - This is a fuzzy one. Can we clarify the distinction between:
1) things done to distinguish delimiter characters used between delimiters (.e.g quotes "a\"b" or 'a\'b'),
2) escape sequences which are traditional ways to type non-printing characters using printing characters (e.g. tab, newline, return),
3) character sequences that are used to express a character that has special meaning (e.g. <, >, &),
4) a character that cannot be typed so a character sequence is used to express the character
5) conversion of all characters into a non-UTF/ASCII representation, such as Base64 where the intent is not compression
6) ?
(#10850)
- kaisellgren
- DevNet Resident
- Posts: 1675
- Joined: Sat Jan 07, 2006 5:52 am
- Location: Lahti, Finland.
Re: Lugubriousness
"Business Rules" (or sometimes called "Domain Rules") are those that define what the data should be, while "Business Validation" is the one that applies them in practice. Sometimes also called "Business Rules Validation" or "Domain Rules Validation". I use those terms personally for more complex situations (e.g. checking whether a username is empty or email is in form of xxx@yyy.zzz is simply a "Validation" process). This is just another area of misconception, some people use "Business Validation" for much simpler processes whereas some people think "Business Validation" can be only used for complex multi-level checks. To some people "Business Validation" has nothing to do with security, and to some people security may be involved.arborint wrote:Are there other more common terms for this "Business Logic", "Business Rules" ?
The difference between Escaping and Encoding is indeed fuzzy. Often you will see people using "Output Escaping" instead of "Output Encoding" and vice versa. Even the security experts at OWASP alternate these terms and I guess the meaning of these two words are pretty much the same. There's one thing I can think of. Escaping methods are always simpler than Encoding methods. Take a look at mysql_real_escape_string() source code and compare it to base64_encode() source code. In case of Escaping, one usually has to take care of a few particular bytes or sequences of bytes. Have you ever thought that you could actually use Encoding (like Base64) for database input?arborint wrote:1) things done to distinguish delimiter characters used between delimiters (.e.g quotes "a\"b" or 'a\'b'),
2) escape sequences which are traditional ways to type non-printing characters using printing characters (e.g. tab, newline, return),
3) character sequences that are used to express a character that has special meaning (e.g. <, >, &),
4) a character that cannot be typed so a character sequence is used to express the character
5) conversion of all characters into a non-UTF/ASCII representation, such as Base64 where the intent is not compression
6) ?
Imagine the string "SELECT 'let's select';". Now, in case of Escaping, all we need to do is to take care of one character. However, with Encoding usually the whole message is encoded. Not so fast... Encoding does not have to encode the whole message (encoding HTML entities)... or does it? I guess there's no single definite answer to whether "& -> &" is actually Encoding or Escaping. Or is there?
Think about these:
Code: Select all
SELECT 'let\ 's select something & enjoy the wheather';
SELECT 'let's select something & enjoy the wheather';Wouldn't they both work as well? The former is usually called Escaping while the ladder is called Encoding, but is there really a difference? Maybe we could say that Escaping is simpler and it is usually about placing a character in front of a meta character to let the meta character to "escape" while Encoding is more of a radical transformation of some bytes?
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Re: Lugubriousness
I agree, I just have not heard the term "Business Validation" so I wonder what other call it, and what the either clearest or most commonly used term is. It may be "Business Validation".kaisellgren wrote:"Business Rules" (or sometimes called "Domain Rules") are those that define what the data should be, while "Business Validation" is the one that applies them in practice. Sometimes also called "Business Rules Validation" or "Domain Rules Validation". I use those terms personally for more complex situations (e.g. checking whether a username is empty or email is in form of xxx@yyy.zzz is simply a "Validation" process). This is just another area of misconception, some people use "Business Validation" for much simpler processes whereas some people think "Business Validation" can be only used for complex multi-level checks. To some people "Business Validation" has nothing to do with security, and to some people security may be involved.
Is "escaping" every anything other than adding backslashes before characters that would cause a parsing problem between delimiters? Encoding clearly is changing either some offending characters or all characters to a different form.kaisellgren wrote:The difference between Escaping and Encoding is indeed fuzzy. Often you will see people using "Output Escaping" instead of "Output Encoding" and vice versa. Even the security experts at OWASP alternate these terms and I guess the meaning of these two words are pretty much the same. There's one thing I can think of. Escaping methods are always simpler than Encoding methods. Take a look at mysql_real_escape_string() source code and compare it to base64_encode() source code. In case of Escaping, one usually has to take care of a few particular bytes or sequences of bytes. Have you ever thought that you could actually use Encoding (like Base64) for database input?
I don't know the answer, but I think if you consider Escaping about delimiters to make things parse properly, and Encoding is about expressing enitites that are beyond the character set. For example, in HTML "&" may not cause a parse error so "& -> &" is Encoding. But "<" will cause a parse error, so "< -> <" would be both Encoding and Escaping, or many it is better to say Encoding for the purpose of Escaping.kaisellgren wrote:Imagine the string "SELECT 'let's select';". Now, in case of Escaping, all we need to do is to take care of one character. However, with Encoding usually the whole message is encoded. Not so fast... Encoding does not have to encode the whole message (encoding HTML entities)... or does it? I guess there's no single definite answer to whether "& -> &" is actually Encoding or Escaping. Or is there?
No, they would not work the same in SQL, but they would display the same if echo'd as HTML. That points out that the target language is a key criterion in deciding what to do.kaisellgren wrote:Think about these:
(I had to put a space in front of ' otherwise my \ was omitted...)Code: Select all
SELECT 'let\ 's select something & enjoy the wheather'; SELECT 'let's select something & enjoy the wheather';
Wouldn't they both work as well? The former is usually called Escaping while the ladder is called Encoding, but is there really a difference? Maybe we could say that Escaping is simpler and it is usually about placing a character in front of a meta character to let the meta character to "escape" while Encoding is more of a radical transformation of some bytes?
(#10850)
- kaisellgren
- DevNet Resident
- Posts: 1675
- Joined: Sat Jan 07, 2006 5:52 am
- Location: Lahti, Finland.
Re: Lugubriousness
Actually I meant to say that they could work the same in SQL provided that the SQL engine supports that. Imagine MySQL supporting an HTML entity encoding on the fly, so, & would be stored as ' on the database.arborint wrote:No, they would not work the same in SQL.
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Re: Lugubriousness
OK ... understood.kaisellgren wrote:Actually I meant to say that they could work the same in SQL provided that the SQL engine supports that. Imagine MySQL supporting an HTML entity encoding on the fly, so, & would be stored as ' on the database.
I do think that the target is critical to this discussion though. The main targets are PHP strings, HTML, Javascript and SQL, so those should be a focus of helping people do the right thing.
(#10850)
- kaisellgren
- DevNet Resident
- Posts: 1675
- Joined: Sat Jan 07, 2006 5:52 am
- Location: Lahti, Finland.
Re: Lugubriousness
I took a shower and thought about this. I thought about how one would code a parser for these two situations and I realized that a parser that uses Escaping does not "unescape". When it hits the escape character, it will flag the next (possible) meta character to be disarmed; i.e., nothing to be "decoded/unescaped". However, in case of Encoding, there would always be a decoding process. Does that make sense? I think that explains it pretty well and if that is right, then anyone talking about "Output Escaping" would be using these terms improperly. I also think this is related to your thought about "targets". When it comes to HTML, encoding/decoding simply seems to make more sense, but with LDAP/SQL among others the target context is simpler to handle with Escaping.
So, I'm saying that the target defines which method to use (Encoding or Escaping), and the difference between the two is basically the fact that there is no "revert" process in Escaping.
So, I'm saying that the target defines which method to use (Encoding or Escaping), and the difference between the two is basically the fact that there is no "revert" process in Escaping.
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Re: Lugubriousness
Yes, that is getting to the core if the difference and is an excellent description.
These are exceptions to this with hex "\xFF" and octal "\77" escape sequences which are actually decoded in a sense. You could think of \t, \r and \n being decoded too in that unlike \" or \' the outputs stream in these other sequences is different than the input without the backslashes.
I think maybe to only definitive definition is that Escaping is an Encoding scheme that uses the \ character followed by a character sequences.
Just looking through the string functions related to this discussion (no database functions) and it is daunting.
- addcslashes,
- addslashes,
- html_entity_decode
- htmlentities
- htmlspecialchars_decode
- htmlspecialchars
- quoted_printable_decode
- quoted_printable_encode
- quotemeta
- strip_tags
- stripcslashes
- stripslashes
These are exceptions to this with hex "\xFF" and octal "\77" escape sequences which are actually decoded in a sense. You could think of \t, \r and \n being decoded too in that unlike \" or \' the outputs stream in these other sequences is different than the input without the backslashes.
I think maybe to only definitive definition is that Escaping is an Encoding scheme that uses the \ character followed by a character sequences.
Just looking through the string functions related to this discussion (no database functions) and it is daunting.
- addcslashes,
- addslashes,
- html_entity_decode
- htmlentities
- htmlspecialchars_decode
- htmlspecialchars
- quoted_printable_decode
- quoted_printable_encode
- quotemeta
- strip_tags
- stripcslashes
- stripslashes
(#10850)
Re: Lugubriousness
+ preg_quotearborint wrote: Just looking through the string functions related to this discussion (no database functions) and it is daunting.
[...]
+ escapeshellcmd/escapeshellarg
+ mb_decode_mimeheader / mb_encode_mimeheader
+ md_decode_numericentity / md_encode_numericentity
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Re: Lugubriousness
Thank you for making it more daunting! 
Now do we identify the common vectors and match the set of functions to the threat?
Now do we identify the common vectors and match the set of functions to the threat?
(#10850)
- kaisellgren
- DevNet Resident
- Posts: 1675
- Joined: Sat Jan 07, 2006 5:52 am
- Location: Lahti, Finland.
Re: Lugubriousness
Really?arborint wrote:I think maybe to only definitive definition is that Escaping is an Encoding scheme that uses the \ character followed by a character sequences.
I think the way the process is done separates encoding and escaping. The parser simply keeps reading until it faces a meta character. If it faces a "n" character, it simply copies it to the buffer, but if it faces a "\" character, it will flag "escape char" to true, and if the next character is a meta character - disarm it and set "escape char" flag to false. On the other hand, if the character is one of the special characters (\n \r \t etc), then set "escape char" to false and copy a 0x09, 0x0a, etc to the buffer. There's no transformation done, it's just analyzing a few special bytes (or characters) and acting in a specific way. Now, if you think about strip_tags(), it will simply do a regular expression -like search and replace (replace with empty). It's a direct transform of data. The same thing applies to base64_encode/decode, which does some very radical changes to the data. On the contrary, addslashes() will do the thing I explained earlier - the data is not modified, it's copied to a buffer based on a particular parsing flow.
If we think about htmlentities(), it translates & to &, so, it would be a search-and-replace function - hence encoder. However, if you have content similar to "test & test", and you are about to decode it, there will be an error in the decoding process. The w3c validator parser will throw errors at you if you do that, but web browsers will happily solve that for you which I think what leads many people to a confusion and to use the term Escaping when it comes to HTML entity encoding. Web browsers also fix malfunctioning Javascript, VBScript, HTML, etc.
So, based on all that:
Escaping
preg_quote
escapeshellcmd
escapeshellarg
addslahes
addcslahes
quotemeta
strip(c)slashes (these are basically reverse escaping
Encoding/Decoding
html_entity_decode
htmlentities
htmlspecialchars_decode
htmlspecialchars
quoted_printable_decode
quoted_printable_encode
mb_decode_mimeheader
mb_encode_mimeheader
mb_decode_numericentity
mb_encode_numericentity
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Re: Lugubriousness
I think that is the exception to the 99% rule.kaisellgren wrote:Really?Have a look at how SQL Server handles escaping - you need to type '' instead of '.
I don't think you can say that \n \r \t \77 \xFF are is not transformations. It is pretty clear that unlike \' or \" the output is transformed to something different than the input -- even though it is called escaping. I think it is fine to call them escaping even though they technically transform the data.kaisellgren wrote:I think the way the process is done separates encoding and escaping. The parser simply keeps reading until it faces a meta character. If it faces a "n" character, it simply copies it to the buffer, but if it faces a "\" character, it will flag "escape char" to true, and if the next character is a meta character - disarm it and set "escape char" flag to false. On the other hand, if the character is one of the special characters (\n \r \t etc), then set "escape char" to false and copy a 0x09, 0x0a, etc to the buffer. There's no transformation done, it's just analyzing a few special bytes (or characters) and acting in a specific way.
I agree with your division. I am not sure how much it matters (to programmers) in the discussion of dealing with the different attack vectors. Though if we enumerate those we may see a clearer pattern.kaisellgren wrote:So, based on all that:
Escaping
preg_quote
escapeshellcmd
escapeshellarg
addslahes
addcslahes
quotemeta
strip(c)slashes (these are basically reverse escaping)
Encoding/Decoding
html_entity_decode
htmlentities
htmlspecialchars_decode
htmlspecialchars
quoted_printable_decode
quoted_printable_encode
mb_decode_mimeheader
mb_encode_mimeheader
mb_decode_numericentity
mb_encode_numericentity
(#10850)