Page 1 of 1

What to do before data goes in or out?

Posted: Wed Nov 18, 2009 4:39 pm
by thinsoldier
What are the best functions to run on data before it goes in the database? What's best to do before showing it on a web page?

Below I show what I currently do. Some things I only do because someone told me to and I'm not really sure what its for.
So even if you could just tell me what needs commenting and what the comment should say I'd be grateful.

What do you people do before data insertion and before display? Care to share your code/comments/thought process?

Besides security, my other query is just what the heck to do about the strange characters people copy from Microsoft word (see FixMicrosoftWordPastes function below).


Before going into the database:

Code: Select all

$string = $_POST[$field];
$string = FixMicrosoftWordPastes($string);
 
if(mb_detect_encoding($string) == 'UTF-8')
{ $string =  mb_convert_encoding($string, "ISO-8859-1"); }
 
$string = html_entity_decode($string);
$string = strip_tags($string);
$string = mysql_real_escape_string($string); 

Before being displayed: ... nothing apparently ... occasionally I use markdown on some projects but that's about it.

Code: Select all

// Here's the FixMicrosoftWordPastes function
 
function FixMicrosoftWordPastes($body)
{   
$trans_tbl = array() ;
    
    $trans_tbl[chr(11)] = "\n"; // some sort of new line character
    
    //$trans_tbl[chr(34)] = '"' ;   //  quote
    //$trans_tbl[chr(38)] = '&' ;   //  ampersand
    // Don't search for ampersand by itself it this way.
    // It complicates things when the text contains • — &eacute etc....
    
    $trans_tbl[chr(128)] = '€' ;  //  euro
    $trans_tbl[chr(129)] = '€' ;  //  euro
    $trans_tbl[chr(130)] = '‚' ;  //  low quote
    $trans_tbl[chr(131)] = 'ƒ' ;   //  florin
    $trans_tbl[chr(132)] = '„' ;  //  double low quote
    $trans_tbl[chr(133)] = '…' ;  //  ellipsis
    $trans_tbl[chr(134)] = '†' ;  //  dagger
    $trans_tbl[chr(135)] = '‡' ;  //  double dagger
    $trans_tbl[chr(136)] = 'ˆ' ;   //  circumflex
    $trans_tbl[chr(137)] = '‰' ;  //  per thousand
    $trans_tbl[chr(138)] = 'Š' ;   //  S caron
    $trans_tbl[chr(139)] = '‹' ;  //  left angle quote
    $trans_tbl[chr(140)] = 'Œ' ;   //  OE ligature
    $trans_tbl[chr(142)] = 'Ž' ;   //  Z caron
    $trans_tbl[chr(145)] = '‘' ;  //  left single quote
    $trans_tbl[chr(146)] = '’' ;  //  right single quote
    $trans_tbl[chr(147)] = '“' ;  //  left double quote
    $trans_tbl[chr(148)] = '”' ;  //  right double quote
    $trans_tbl[chr(149)] = '•' ;  //  bullet
    $trans_tbl[chr(150)] = '–' ;  //  en dash
    $trans_tbl[chr(151)] = '—' ;  //  em dash
    $trans_tbl[chr(152)] = '˜' ;   //  small tilde
    $trans_tbl[chr(153)] = '™' ;  //  trademark
    $trans_tbl[chr(154)] = 'š' ;   //  small s caron
    $trans_tbl[chr(155)] = '›' ;  //  right angle quote
    $trans_tbl[chr(156)] = 'œ' ;   //  oe ligature
    $trans_tbl[chr(158)] = 'ž' ;   //  small z caron
    $trans_tbl[chr(159)] = 'Ÿ' ;   //  Y with diaeresis
    for ( $i=160; $i<=255; $i++ ) {
        $trans_tbl[chr($i)] = '&#' . $i . ';' ;
    }
    //$trans_tbl[chr(8216)] = "'" ;     //  single smart quote left
    //$trans_tbl[chr(8217)] = "'" ;     //  single smart quote right
    //$to_return = str_replace('?', "'", $to_return);
 
    $to_return = strtr ( $body , $trans_tbl );
    
    // now handle ampersands where they occur all by itself.
    $to_return = preg_replace('/ & /', ' & ', $to_return);
    
    return  $to_return;
  }

Re: What to do before data goes in or out?

Posted: Thu Nov 19, 2009 1:24 am
by kaisellgren
It depends a lot on where the data ends up in. Usually with databases, it goes to the DML (Data Manipulation Language) part of the SQL query, and escaping the data is the only thing you should do (in addition to placing the data within quotes - making it valid DML code). When you are outputting data to the client, you need to encode it properly.
thinsoldier wrote:

Code: Select all

$string = html_entity_decode($string);
$string = strip_tags($string);
$string = mysql_real_escape_string($string);
This isn't good. There's no general way of cleansing data, you must take case specific precautions. Here's an example:

Code: Select all

$name = mysql_real_escape_string($_GET['name']);
mysql_query("SELECT `id` FROM `users` WHERE `name`='$name';");
 
$message = htmlspecialchars($_GET['message'],ENT_QUOTES,'UTF-8'); // Encode the output to prevent XSS, use UTF-8.
header('Content-type: text/html; charset=UTF-8'); // Tell the client that the output is UTF-8 if you haven't already.
echo "<div>You can now place this in between HTML tags: $message</div>";
The first part escapes the data before inserting it into the database. The second part displays user supplied data. Does this make any sense to you?

You said you had some problems with text copied from Word, could you provide some examples?

Re: What to do before data goes in or out?

Posted: Thu Nov 19, 2009 9:30 am
by thinsoldier
So you're saying I don't need this part at all before putting data into the database:

Code: Select all

$string = $_POST[$field];
$string = FixMicrosoftWordPastes($string);
if(mb_detect_encoding($string) == 'UTF-8')
{ $string =  mb_convert_encoding($string, "ISO-8859-1"); }
And these parts are "bad":

Code: Select all

$string = html_entity_decode($string);
$string = strip_tags($string);
$string = mysql_real_escape_string($string);
bad how?

kaisellgren wrote:The first part escapes the data before inserting it into the database.
I always run mysql_real_escape_string.
kaisellgren wrote:The second part displays user supplied data.
Now, this part with the htmlspecialchars($_GET['message'],ENT_QUOTES,'UTF-8'); is new to me. But 99% of the stuff I can remember working on I never had to just take a users input and redisplay it on the next page. I probably always save their data in the database before showing it again.

Now I realize the rest of my questions are probably less about security and more about dealing with utf-8 and whether I should deal with it before or after it goes into the database.

My MSWord issues aren't really about security at all. It's about when people copy/paste their resumé from MS Word into a job application form and numerous characters wind up looking like questions marks, upside down questions marks, black squares, and áÉ or some other strange 2 character combination. I'm looking for a good example now.

MSWord Example

Posted: Thu Nov 19, 2009 10:17 am
by thinsoldier

Code: Select all

Original Text
Pursuing Bachelor of Commerce – Saint Mary’s University
   Major:  Global Business Management
   Minor:     International Development Studies
• Received a bursary grant from the university
• Financed 60% of college expenses through grants and full-time work
• Graduation – May 2002
• Studied under Lónîéçe Thomás
 
 
After htmlentities() only 
Pursuing Bachelor of Commerce – Saint Mary’s University
   Major:  Global Business Management
   Minor:     International Development Studies
• Received a bursary grant from the university
• Financed 60% of college expenses through grants and full-time work
• Graduation – May 2002
• Studied under L&oacute;n&icirc;&eacute;&ccedil;e Thom&aacute;s
 
 
After FixMicrosoftWordPastes() only
Pursuing Bachelor of Commerce &#8211; Saint Mary&#8217;s University
   Major:  Global Business Management
   Minor:     International Development Studies
&#8226; Received a bursary grant from the university
&#8226; Financed 60% of college expenses through grants and full-time work
&#8226; Graduation &#8211; May 2002
 
&#8226; Studied under L&#243;n&#238;&#233;&#231;e Thom&#225;s
htmlentities doesn't catch the bullets, dashes and lots of other stuff I've seen over the years. What's strange is sometimes it does and sometimes it doesn't. FixMicrosoftWordPastes only catches some characters sometimes.

For example, sometimes htmlentities will convert a bullet to &middot; sometimes it converts it to &bull; most of the time it fails to convert bullets to anything. FixMicrosoftWordPastes most of the time converts bullets to &#8226; and sometimes it fails. They both tend to fail on some other characters. I'm guessing between the MANY versions of MS Word it has at least 3 characters that it uses for just bullets and htmlentities and FixMicrosoftWordPastes only recognize 2 of them. Similar issue for some other characters.

And when bullets or other characters aren't regocnized by one of the functions and converted sometimes they show up as a bullet in the browser, other times they show up as strange characters. I'll see bullets in my firefox and the client sees black diamond question marks in their IE. I'll convert the page charset to utf-8 and see some bullets and some question marks in one browser and all question marks in another. Or a mix of question marks and 2 character pairs where the 1 character should have been.

Re: What to do before data goes in or out?

Posted: Fri Nov 20, 2009 1:41 am
by kaisellgren
Are you asking about security here?

Does it fail if you insert the original text into a UTF-8 column and display it directly from there and state the output is UTF-8?

Re: What to do before data goes in or out?

Posted: Fri Nov 20, 2009 9:35 am
by thinsoldier
kaisellgren wrote:Are you asking about security here?
If you have no other suggestions for me than making sure to escape input and Encode the output to prevent XSS then I guess not.

Most of the rest of what I'm asking about is really a visual issue and not security related. I should probably ask someone to more this topic to another part of the forum.
kaisellgren wrote:Does it fail if you insert the original text into a UTF-8 column and display it directly from there and state the output is UTF-8?

Here's the thing. On my live server at work our old phpmyadmin doesn't show the collation and I don't have direct access to the server to check in mysql itself. So I'll probably continue to have utf-8 issues until I can change every one of my database fields to utf8 collation then?

Oh, when testing out this code $message = htmlspecialchars($_GET['message'],ENT_QUOTES,'UTF-8'); It seems that some characters cause htmlspecialchars to not return anything. Same with htmlentities. To get around this I tried changing the last parameter to the result of running the message text through mb_detect_encoding. But it turns out sometimes mb_detect_encoding returns nothing as well.

Re: What to do before data goes in or out?

Posted: Fri Nov 20, 2009 11:21 am
by kaisellgren
The functions do return something. Turn on error reporting and use var_dump(). Try running mb_detect_encoding() against the text you have in Word, or look inside Word to see what encoding it uses and compare it to the one you are using in your database. At some point your character encodings mismatch.

Re: What to do before data goes in or out?

Posted: Sun Nov 29, 2009 11:59 pm
by josh
If you are using user-inputs as part of the schema name for your SQL, like this

$sql = 'SELECT * from `' . $table . '` WHERE 1';

It is important to use a regex to filter out anything but letters, underscores, and numbers.