How to store character entities in a database -best practice

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

How to store character entities in a database -best practice

Post by batfastad »

Hi guys

This is something I've struggled in understanding for a while.

I'm moving data from our filemaker database, into MySQL.

When I export data from filemaker, I do so using PHP and a FileMaker XML class (FX.php) and I basically output a text page (rather than HTML) with all the SQL statements built so I just paste them into PHPMyAdmin to import them into MySQL like so...

Code: Select all

INSERT INTO notes VALUES ('',"000004","","","","Decals & Tank & Yoke Protectors","0");
INSERT INTO notes VALUES ('',"000027","","","","KTM, Malaguti, Yamaha, Malaguti & KTM; Accessories, helmets.","0");
INSERT INTO notes VALUES ('',"000028","","","","Customized scooters Lambretta & Vespa in Mod style.","0");
INSERT INTO notes VALUES ('',"000036","","","","Lubricants, oils","0");
INSERT INTO notes VALUES ('',"000044","","","","Import","0");
INSERT INTO notes VALUES ('',"000053","","","","KTM & Betamotor, Suzuki","0");
INSERT INTO notes VALUES ('',"000057","","","","Cylinders, Engine Shafts, Connecting Rods & Connecting Rods system. Kits","0");
INSERT INTO notes VALUES ('',"000061","","","","Small Bikes","0");
INSERT INTO notes VALUES ('',"000065","","","","Small Bikes","0");
The question I have though relates to how you store funny characters - characters that usually require HTML entities.

Is it best to store the data in MySQL with the entities encoded for HTML already?
I would have thought that's not the best policy.

Here's my PHP code which runs through the records...

Code: Select all

<?php

require ('config.php');

function sql_safe($value)
{
   // Stripslashes
   if (get_magic_quotes_gpc()) {
       $value = stripslashes($value);
   }
   // Quote if not a number or a numeric string
//   if (!is_numeric($value)) {
//       $value = "'" . mysql_real_escape_string($value) . "'";
//   }

mysql_real_escape_string($value);
   return $value;
}


$skipvalue = $_GET['skipvalue'];

// DO FILEMAKER QUERY
$fmp_query = new FX($fm_ip, $fm_port, $fm_schema);
$fmp_query->SetDBData('NOTES', 'TEST', 10000);
$fmp_query->SetDBPassword($fm_pass, $fm_user);
$fmp_query->AddDBParam('COMPANY ID', $company_id);
$fmp_query->AddSortParam('CREATE STAMP', 'ascend');
$fmp_query->FMSkipRecords($skipvalue);
$fmp_result = $fmp_query->FMFindAll(true, full, false);

$found_count = $fmp_result['foundCount'];
$fmp_error_code = $fmp_result['errorCode'];

//echo $fmp_error_code;

// LOOP THROUGH FILEMAKER RECORDS
foreach ( $fmp_result['data'] as $fmp_row) {

	$company_id = trim($fmp_row['COMPANY ID']);
	$contact_id = trim($fmp_row['CONTACT ID']);
	$create_stamp = trim($fmp_row['CREATE STAMP']);
	$create_name = trim($fmp_row['CREATE NAME']);
	$note = trim($fmp_row['NOTE']);
	$flag = trim($fmp_row['FLAG']);

	if ($flag !== 1) {
		$flag = 0;
	}

	$company_id = str_replace('II#', '', $company_id);

        // SEE SQL SAFE FUNCTION ABOVE - THE QUOTE BIT IS COMMENTED OUT DELIBERATELY
	$note = sql_safe($note);

        // REMOVE NEWLINES FROM OUR FILEMAKER note VARIABLE, REPLACE WITH A SPACE THEN TRIM
	$note = trim(str_replace("\n", ' ', $note));

	$note = addslashes($note);

	// ADD SQL INSTRUCTION TO OUR MEGA VARIABLE
	$sql_text .= <<<HTML
INSERT INTO notes VALUES ('',"$company_id","$contact_id","$create_stamp","$create_name","$note","$flag");\r\n
HTML;

}

// GET RID OF ANY HTML TAGS FROM OUR MEGA VARIABLE
$sql_text = strip_tags($sql_text);

// DEAL ENTITIES - CONFUSION IS ALL HERE!
$sql_text = html_entity_decode($sql_text, ENT_QUOTES);
$sql_text = preg_replace('/&#(\d+);/me', "chr(\\1)", $sql_text);
$sql_text = preg_replace('/&#x([a-f0-9]+);/mei', "chr(0x\\1)", $sql_text);

// REMOVE DOUBLE SPACES
while ( substr_count($sql_text, '  ') > 0) {
	$sql_text = str_replace('  ', ' ', $sql_text);
}

header('Content-type: text/plain; Content-encoding: utf-8');
echo $sql_text;

?>
So each record gets added to a giant variable called $sql_text, then I run some functions to try and deal with the entities.

Basically I'm confused!
I thought this was all working ok until I came across the trademark TM in my filemaker database. Numeric entity &.#8482; (without the period!)

Once all the above is run and I have my output, SQL chokes on the import - because where the &.#8482; was happening becomes a double quote (speech mark) character "

I think this trademark character problem is the character encoding I'm outputting. I'm not sure if I need the utf-8 on the penultimate line.
The collations in MySQL seem to all be chosen automatically as latin1_swedish_ci (not sure what I need it set to).


So first up, what am I doing with my character encoding here that's causing my TM character to become a " character.

Secondly, what's best practice in terms of data storage?
Store the raw characters or the HTML entity versions?


Thanks guys

Ben
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Post by Mordred »

Use sql_safe() on all variables in the query, and don't do that: $note = addslashes($note);

After that, the data is okay to be inserted into the DB as it is, no need for strip_tags(), html_entity_decode(), preg_replace() etc.

HTML cleaning should be done when you output the HTML!
If you won't use that data in the DB for anything else than HTML output, you can keep it pre-cleaned of course, in that case you need only strip_tags.

In all cases, escaping/cleaning should be done on each variable alone, not on the query as a whole, duh!
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

don't htmlspecialchars() or htmlentities() or striptags() input into the database, do this when you output it later.

SQL escape in => store => HTML escape out
not
SQL and HTML escape in => store => out

just realised mordred said all this and more, :)
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Post by batfastad »

Hi Mordred

Thanks for the info!

I did think that the data should be stored in the DB as is, and not stored ready-formatted for HTML.


Because this script is cleaning data for me to import it into MySQL, I guess I should probably keep the strip_tags(), html_entity_decode() to convert the HTML entities output from FileMaker's XML into normal chars.


I've just tried commenting out the addslashes line, but the SQL command gets stuck because there are often quote marks " within the $note variable.
Should I just escape those manually using str_replace(), rather than using addslashes()?
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Post by batfastad »

ole wrote:don't htmlspecialchars() or htmlentities() or striptags() input into the database, do this when you output it later.
I was using html_entity_decode() to remove the HTML entities so it gets stored in MySQL as chars rather than entities.

Also, this note field in particular is not meant to have any HTML tags present in it, so do I not need to keep the striptags() function in there to remove any HTML before it gets stored in the database?
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

All you need to know is the correct way to do it:

Code: Select all

$input = mysql_real_escape_string($_GET['input']);
// insert into database

// later
$output = mysql_fetch_assoc($result);
$output = htmlspecialchars($output, ENT_QUOTES);
echo $output;
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Post by batfastad »

Thanks for all your help guys!!


I've just tried removing the following functions from my script...
strip_tags()
html_entity_decode()
and the 2 preg_replace() lines

And running mysql_real_escape_string() on the variables, but when I check phpMyAdmin the entities and tags have been stored in the database when I don't want them in there :(

So surely I need to run these functions on the input data if I don't want them stored in MySQL...
strip_tags - to remove any HTML tags
html_entity_decode - to convert HTML entity codes to characters (&copy; etc)
and the preg_replace lines to convert HTML numeric entities to their actual characters

I thought I'd need to run those functions on the FileMaker output, so it's clean for storing in MySQL.
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Post by Mordred »

Well, if you don't want them in there, you certainly can remove them, as long as you do it per variable, and after the cleanup you properly escape each variable with mysql_real_escape_string.

The choice on whether to do this or not lies in the purpose of the data. Is there a chance that you will some day need the html tags that were there (but got stripped). Are you going to use an output encoding which will not be able to show some html entity you decoded? Or maybe if the conversion is done on output it would take too much time for your need?

For a real-life example, I write a simple barebones forum which is pretty fast as it is now, but it still bugs me that scanning the posts when I get them from the database and checking them for unallowed tags and generally fixing their html (yes, html is allowed, and no, I don't want bbcode) takes about 10-20 msec. This is generally not much, and if you don't have many users simultaneously viewing the data being filtered on-the-fly, it shouldn't worry you. On the other hand, if I want to make this really really as fast as possible, I may decide to ditch the flexibility to define "good" html tags dinamically and just filter the posts as they are submitted, and later just display them without further filtering. Since the code executes in the range of 50-100 msec, 10-20 msec speedup is quite significant.

I would advice to go for flexibility and do it as in the scheme ole illustrated:
SQL escape in => store => HTML escape out
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Post by batfastad »

Yeah I think you're right.

Definitely the way to go. I'll be outputting as HTML 70% of the time, but also to word, excel, images and PDFs. Plus the fields in question aren't meant to have HTML entered into them anyway.

Also I realise I need to run the cleaning on the individual vars rather than the whole concatenated massive variable.
I wasn't paying too much attention to bits like that because it's a script I only need to run the once to get 70,000 records out of filemaker and into MySQL.

My last question is actually a problem with my script...

In the loop of the script, where I'm looping through the filemaker data, I have one line...

Code: Select all

$note = addslashes($note);
Now since my $note variable gets run through the function sql_safe(), which runs the mysql_real_escape_string() on the variable, you would think that I could safely get rid of this addslashes line.

Unfortunately though when I get rid of that it breaks the MySQL import, because the value of $note about 30% of the time contains speech marks "" in it.

How can I fix this?
I thought the mysql_real_escape_string() would deal with speech marks in my variable?


Thanks

Ben
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Post by Mordred »

Aah, I see it now

Code: Select all

$value = mysql_real_escape_string($value);
instead of

Code: Select all

mysql_real_escape_string($value);
And yes - remove the addslashes call, mysql_real_escape_string() will escape exactly what needs to be escaped as not to mess up with the query.
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Post by batfastad »

Hi

Unfortunately it's still not working.
The quotes coming through in $note are still not being escaped


Any ideas?
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Post by Mordred »

Paste your latest code.
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Post by batfastad »

Code: Select all

<?php
// NOTES EXPORT FROM FILEMAKER
require ('config.php');

function sql_safe($value) {
	// Stripslashes
	if (get_magic_quotes_gpc()) {
		$value = stripslashes($value);
	}

	// Quote if not a number or a numeric string
//	if (!is_numeric($value) or substr($value, 0, 1) == 0) {
//		$value = "'" . mysql_real_escape_string($value) . "'";
//	}

	mysql_real_escape_string($value);
	return $value;
}

$skipvalue = $_GET['skipvalue'];

// START FMP QUERY
$fmp_query = new FX($fm_ip, $fm_port, $fm_schema);
$fmp_query->SetDBData('NOTES', 'TEST', 10000);
$fmp_query->SetDBPassword($fm_pass, $fm_user);
$fmp_query->AddSortParam('CREATE STAMP', 'ascend');
$fmp_query->FMSkipRecords($skipvalue);
$fmp_result = $fmp_query->FMFindAll(true, full, false);

$found_count = $fmp_result['foundCount'];
$fmp_error_code = $fmp_result['errorCode'];

// LOOP THROUGH NOTES
foreach ( $fmp_result['data'] as $fmp_row) {

	$company_id = trim($fmp_row['COMPANY ID']);
	$contact_id = trim($fmp_row['CONTACT ID']);
	$create_stamp = trim($fmp_row['CREATE STAMP']);
	$create_name = trim($fmp_row['CREATE NAME']);
	$note = trim($fmp_row['NOTE']);
	$flag = trim($fmp_row['FLAG']);

	if ($flag !== 1) {
		$flag = 0;
	}

	if ($create_stamp == '') {
		$create_stamp = 965309762;
	}

	$company_id = str_replace('II#', '', $company_id);

	$note = sql_safe($note);
	$note = strip_tags($note);
	//$note = html_entity_decode($note, ENT_QUOTES);
	//$note = numericentitieshtml($note);

	$note = str_replace("\n\n", "\n", $note);
	$note = str_replace("\r\n", "\n", $note);
	$note = str_replace("\n", "\r\n", $note);

	//$note = trim(addslashes($note));

	// ECHO SQL
	$sql_text .= <<<HTML
INSERT INTO notes VALUES ('',"$company_id","$contact_id",$create_stamp,"$create_name","$note",$flag);\r\n
HTML;

}

// CLEAR DOUBLE SPACES
while ( substr_count($sql_text, '  ') > 0) {
	$sql_text = str_replace('  ', ' ', $sql_text);
}

header('Content-type: text/plain; charset=utf-8');
echo $sql_text;

?>

Thanks

Ben
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Post by Mordred »

You don't seem to have followed any of the advices ... :(

Code: Select all

mysql_real_escape_string($value); 
        return $value;
should be

Code: Select all

return mysql_real_escape_string($value);
or

Code: Select all

$value = mysql_real_escape_string($value); 
        return $value;
Notice the difference? You were doing nothing with the return of mysql_real_escape string!

Code: Select all

$note = sql_safe($note); 
        $note = strip_tags($note); 
        //$note = html_entity_decode($note, ENT_QUOTES); 
        //$note = numericentitieshtml($note); 

        $note = str_replace("\n\n", "\n", $note); 
        $note = str_replace("\r\n", "\n", $note); 
        $note = str_replace("\n", "\r\n", $note);
Generally, escaping for database should be last:

Code: Select all

$note = strip_tags($note); 
        //$note = html_entity_decode($note, ENT_QUOTES); 
        //$note = numericentitieshtml($note); 

        $note = str_replace("\n\n", "\n", $note); 
        $note = str_replace("\r\n", "\n", $note); 
        $note = str_replace("\n", "\r\n", $note);
        $note = sql_safe($note);
It's not wrong in your code, but maybe some day someone would add another innocent-looking line of code in that snippet and suddenly the $note would become vulnerable again.

It is also neccessary to escape the other variables in your query, not just note.
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Post by batfastad »

Aha!!

Mordred, that works a treat.
Thanks so much for all your advice!! :lol: :lol:

The only variable that's ever been entered by a user is $note. All the others have been auto-entered by FileMaker. Which is why $note is the only one I'm checking.
I know that's bad practice but this is a script that will only be run once, to help me convert all our data into MySQL from FileMaker.


Now I just have one final problem...

Entities and charsets!
My code for converting HTML entities into their real characters (commented out under the strip_tags line above)

Code: Select all

// CONVERT HTML NUMERIC ENTITIES TO CHARS
function numericentitieshtml($str) {
        return utf8_encode(preg_replace('/&#(\d+);/e', 'chr(str_replace(";", "", str_replace("&#","","$0")))', $str));
}

...

$note = html_entity_decode($note, ENT_QUOTES); // CONVERT HTML ALPHA ENTITIES LIKE &.trade;
$note = numericentitieshtml($note); // CONVERT HTML NUMERIC ENTITIES
Then when I output the plain text I output it with this header

Code: Select all

header('Content-type: text/plain; charset=utf-8');
Unfortunately most of the numeric entities do not get properly converted to their proper characters.
Is that because those characters don't exist in the UTF-8 charset?

For example...

Code: Select all

&.#8482; becomes a " (should be a TM symbol)
&.#8364; becomes ¬ (should be euro symbol)
How do I get this text file output with those characters correctly converted?
What's the best charset to use?

I've read in the PHP manual about bugs in PHP with the UTF-8 charset.

Thanks

Ben
Post Reply