Page 1 of 1

[SOLVED]problem saving record in UTF-8

Posted: Tue Dec 05, 2006 9:05 am
by Jaxolotl
Hi everyone, I'm not shure if this is a PHP or mySQL problem.

I'm trying to make a small CMS experiment that supports non lating languages.
the front end uses UTF-8 as charset correctly
the database collation is utf8_general_ci
the table collation is utf8_general_ci
the field collation is utf8_general_ci
should I use utf8_unicode_ci instead of utf8_general_ci?

platform is
phpMyAdmin 2.6.1
MySQL 4.1.9
apache 1.3.33
php 4.3.10
Firefox 2 / IE 7.0.5730.11



but when I post the contents from my CMS it convert the characters with errors, of course when I do it from the phpMyAdmin it save the characters correctly.

a resume of the HTML front-end

Code: Select all

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
	<head>
		<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	</head>
	<body>
		<form name="THENAME" action="" method="POST">
			<input type="hidden" name="rowid" value="MY VALUE">
			<input type="hidden" name="parent" value="MY VALUE">
			<textarea name="my_content" rows="20" style="width:98%;" title="USE HTML TAGS HERE"></textarea>
		<input name="edit_confirm" value="Confirm edition" type="submit">
		</form>
	</body>
</html>
the resume of the PHP back-end

Code: Select all

$edit_query = "UPDATE `my_table` SET

parent = '".$_POST['parent']."',

my_content = ".process_html_text($_POST['my_content'])."

WHERE rowid='".$_POST['rowid']."'";


if(sql_select($edit_query,$edit_results)){ // my abstraction sql_select() use mysql_query() to store data (it also check connection and handle errors)
	echo "SUCCESS MESSAGE";
}
else{
	echo "MY ERROR MESSAGE";
	// AND MY LOG ERROR SCRIPT
}



###############################################  PROCESS HTML text for db query
function process_html_text($string){
//process_textarea_text
// textarea tag was previously transformed when rendered on the textarea to 
//prevent nested textareas and now is stored correctly

	$patterns = array (
	"#'#",
	'#\[\s{0,}textarea#is', //OPEN TEXTAREA
	'#\[\s{0,}/\s{0,}textarea\s{0,}\]#is'//CLOSE
	);
	$substitutions = array(
	"\'",
	'<textarea',
	'</textarea>'
	);
	$output_string = preg_replace($patterns,$substitutions,trim(stripslashes($string)));
	return "$output_string";
}

of course, if I use latin charset on the frontend and the DB collation the characters are automatically transformed into /&#(\d+);/ decimal format but by doing this, the stored information will need to be rendered in HTML to make it undersundable

for example
* 网页
* 资讯
* 知识
* 音乐
* 图片
* 影视
* 酷帖
* 更多

is the rendered code of

Code: Select all

* 网页
* 资讯
* 知识
* 音乐
* 图片
* 影视
* 酷帖
* 更多
and no comments about the increasing amount of data needed to store large sequences of content in decimal format.

anyone can help me about it?
is there a query error? or what?

Posted: Tue Dec 05, 2006 9:38 am
by feyd
Do your pages send a header character encoding of UTF8 as well? header()

The following may be of interest/help: viewtopic.php?t=36549

headers

Posted: Tue Dec 05, 2006 10:19 am
by Jaxolotl
I'll check bout headers but if I remember right it's ok
the problem is that when I stor the data, some characters are stores correctly but some others NOT
and I don't know the reason why.

ex.
text stored correctly in decimal notation as shown in the front-end
------------------------------------------
* 网页
* 资讯
* 知识
* 音乐
* 图片
* 影视
* 酷帖
* 更多

text stored wrong in UTF-8 notation as shown in the front-end
----------------------------------------------

* 网页
* 资讯
* 知识
* 音�? //THIS IS THE MISSING CHARACTER
* 图片
* 影视
* 酷帖
* 更多


text stored correctly in decimal notation as shown in the back-end
------------------------------------------

Code: Select all

* 网页
* 资讯
* 知识
* 音乐
* 图片
* 影视
* 酷帖
* 更多
text stored wrong in UTF-8 notation as shown in the back-end
----------------------------------------------

Code: Select all

* 网页
* 资讯
* 知识
* 音�
* 图片
* 影视
* 酷帖
* 更多
what I'm doing wrong?
I try also with utf8_unicode_ci but with the same result

the funny thing :( :( :( is that when I use phpMyAdmin interface everything is ok. uffffffff how many time will it takes me to learn uffffffffff uffffffff I'm years long from being at that level

file encoding

Posted: Wed Dec 06, 2006 3:07 am
by Jaxolotl
I was reading the online manual and finde some different oppinion about encoding the file.
Some use utf-8 with no byte order marks and some other keep the source files in ASCII to avoid the Byte Order Mark and then send the header

Code: Select all

header('Content-Type: text/html; charset=utf-8');
the second one is the way I work, anyway I tried the UTF-8 not bomed file encoding but still have the same rendering/storing problem, some characters are stored wright some other doesn't.

I'm going crazy!!!!

Posted: Wed Dec 06, 2006 5:42 am
by dibyendrah
In your mysql connection :
Do this first after connecting to MySQL server.

Code: Select all

sql_query("SET NAMES 'utf8'", $conn);
mysql_select_db($DB);
Othe thing you have to do is create a database or table with character-set : utf8_general_ci or any other unicode in your language.

Posted: Wed Dec 06, 2006 6:07 am
by Jaxolotl
Hi brother dibyendrah, nice to hear from you again

This is the detailed steps I done, can you tell me where I'm on error?

Step 1 Create the database and tables

Code: Select all

-- FIRST QUERY
CREATE DATABASE `my_db` DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;


-- SECOND QUERY
CREATE TABLE `my_table` (
`rowid` INT UNSIGNED NOT NULL AUTO_INCREMENT ,
`my_text` TEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci,
PRIMARY KEY ( `rowid` )
);


-- OR USE A COMBINED QUERY
CREATE TABLE IF NOT EXISTS `my_table` (
  `rowid` int(10) unsigned NOT NULL auto_increment,
  `my_text` text collate utf8_unicode_ci,
  PRIMARY KEY  (`rowid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Then my PHP scripting

1 . Library file unde "libraries" dir (my_lib.php)

Code: Select all

<?php
//error_reporting(0);
session_start();

define('MY_IDENTIFIER',"JAXOLOTL");
define('DEFAULT_LANGUAGE',"eng");
define('DEFAULT_CHARSET',"utf-8");

session_register(MY_IDENTIFIER.'_LANGUAGE');

/* **************************************
*              DB SETTINGS              *
*************************************** */
// LOCAL TEST
$dbhost="localhost";
$dbdb="my_db";
$dbuser="root";
$dbpwd="";

/* **************************************
*          DATABASE CONNECTION          *
*************************************** */

function retrive_data($query, &$rv) {
	global $dbhost;
	global $dbdb;
	global $dbuser;
	global $dbpwd;

	if (!($db_id = mysql_pconnect($dbhost, $dbuser, $dbpwd, MYSQL_CLIENT_INTERACTIVE))) {
		error_log("Could not connect to $dbdb: " . mysql_error());
		return false;
	}
	if (!(mysql_select_db($dbdb, $db_id))) {
		error_log("Could not select database $dbdb: " . mysql_error());
		mysql_close($db_id);
		return false;
	}
	if (!($rv = mysql_query($query, $db_id))) {
		error_log("sql_select: Invalid query: $query: " . mysql_error());
		mysql_close($db_id);
		return false;
	}
	mysql_close($db_id);
	return true;
}


############################################### RETURNS AN HTML STRING FORMAT IF PLAIN ON INPUT
function translate_breaks($string){
	if(preg_match('#\<.*?\>#s',$string)){
		return $string;
	}
	else{
		return nl2br(trim($string));
	}
}


/* **************************************
*                 LANGUAGE              *
*************************************** */

$lang = $_REQUEST['language'];  // if register_globals is off

if (isset($lang)) {
	setcookie(MY_IDENTIFIER.'_LANGUAGE',$lang,time()+(60*60*24*7),"/");
}
elseif (isset($_COOKIE[MY_IDENTIFIER.'_LANGUAGE']))
{
	$lang = $_COOKIE[MY_IDENTIFIER.'_LANGUAGE'];
}
else {
	$lang = DEFAULT_LANGUAGE;	//default language
}

$_SESSION[MY_IDENTIFIER.'_LANGUAGE']=$lang;

header('Content-Type: text/html; charset='.DEFAULT_CHARSET);
?>

2 . my frontend file under the root (index.php)

Code: Select all

<?php include_once("libraries/my_lib.php");?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
	<head>
		<meta http-equiv="Content-Type" content="text/html; charset=<?php echo DEFAULT_CHARSET; ?>">
		<title>UTF-8 TEST</title>
	</head>
	<body>
		<?php
		if(strlen($_POST['my_text'])>=1){
			$create_query = "INSERT INTO `my_table` (my_text) VALUES ('".$_POST['my_text']."')";
			// on the complete version I check for backslashed and null vales ecc, this is only a fast resume.
			if(retrive_data($create_query,$create_results)){
				echo "<strong>Data stored OK</strong><br><br>";
			}
			else{
				echo "<strong>ERROR-Database error</strong><br> Check error log for more information<br><br>";
				// on the complete version I print the error message, or use the tail cmd.
			}
		}
		?>
		<form name="my_form" action="" method="POST">
			<textarea name="my_text" rows="10" style="width:200px;" ></textarea>
			<br />
			<input type="submit">
		</form>
		<h2> Show me the last record </h2>
		<?php
		$query = "SELECT `rowid`,`my_text` FROM `my_table` ORDER BY 'rowid' DESC limit 1";
		retrive_data($query,$results);
		while($my_row = mysql_fetch_array($results)){
			echo translate_breaks($my_row['my_text']);
		}
		?>
	</body>
</html>
the data stored on the db is always strange coded

use this to test

SAMPLE TEXT just copy and paste

雅虎搜索

* 网页
* 资讯
* 知识
* 音乐
* 图片
* 影视
* 酷帖

Posted: Wed Dec 06, 2006 7:05 am
by jmut

Code: Select all

SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
can you give us output of those

Posted: Wed Dec 06, 2006 7:29 am
by Jaxolotl
ARGGGGGGG
this is the output uf the two queries you ask me


Code: Select all


SHOW VARIABLES LIKE 'character_set%'

-- OUTPUT
+-------------------------+------------------------------------------+
+     Variable_name       +                    Value                 +
+-------------------------+------------------------------------------+
+ character_set_client    +                     utf8                 +
+-------------------------+------------------------------------------+
+ character_set_connection+                     utf8                 +
+-------------------------+------------------------------------------+
+ character_set_database  +                     utf8                 +
+-------------------------+------------------------------------------+
+ character_set_results   +                     utf8                 +
+-------------------------+------------------------------------------+
+ character_set_server    +                 latin1                   +
+-------------------------+------------------------------------------+
+ character_set_system    +                     utf8                 +
+-------------------------+------------------------------------------+
+ character_sets_dir      +C:\PROGRA~1\EASYPH~1\mysql\share\charsets/+
+-------------------------+------------------------------------------+




SHOW VARIABLES LIKE 'collation%'; 

+-------------------------+------------------------------------------+
+     Variable_name       +                    Value                 +
+-------------------------+------------------------------------------+
+ collation_connection    +          utf8_unicode_ci                 +
+-------------------------+------------------------------------------+
+ collation_database      +          utf8_unicode_ci                 +
+-------------------------+------------------------------------------+
+ collation_server        +        latin1_swedish_ci                 +
+-------------------------+------------------------------------------+


both times the server is setted on latin!!!!!! ouch
is it this?
don't remember how to change it
if I change it into utf-8 ...does it comprimes the latin tables?

Posted: Wed Dec 06, 2006 8:07 am
by jmut
this doc page will be of interest

http://dev.mysql.com/doc/refman/4.1/en/ ... ction.html

Basically these settings are by default...So if you don't set it explicitly after creating the connection...the server will use config variables, which
in your case are different for client/server part....check on the link...and this is causing you problems.

As dibyendrah suggested executing this query after making the connection should fix this.

Code: Select all

SET NAMES 'utf8';
As for:
If I change it into utf-8 ...does it comprimes the latin tables?
Not sure...I believe not but not sure at all.

SET NAMES

Posted: Wed Dec 06, 2006 8:23 am
by Jaxolotl
Scuse my ignorance, but

Code: Select all

SET NAMES 'utf8';
should I use it on my connection abstraction or where?
I show the two files sources on this post, where and how do I use the SET NAMES command?
I'll apreciate really much your help, this "all about" is new for me and feel a little lost

Re: SET NAMES

Posted: Wed Dec 06, 2006 8:41 am
by jmut
Jaxolotl wrote:Scuse my ignorance, but

Code: Select all

SET NAMES 'utf8';
should I use it on my connection abstraction or where?
I show the two files sources on this post, where and how do I use the SET NAMES command?
I'll apreciate really much your help, this "all about" is new for me and feel a little lost
As soon as you create a connection

Code: Select all

if (!($db_id = mysql_pconnect($dbhost, $dbuser, $dbpwd, MYSQL_CLIENT_INTERACTIVE))) {
                error_log("Could not connect to $dbdb: " . mysql_error());
                return false;
        }
        mysql_query("SET NAMES 'utf8'", $db_id);
If you observe the mysql log during script execution...you should notice this query first ...after making connection.

SET NAMES

Posted: Wed Dec 06, 2006 8:52 am
by Jaxolotl
YESSSSSSSSSSSSS!!!!!!!!!!!!!

I really TNX YOU SO MUCH GUYS!!!!!

now everything goes right!!!!

Posted: Wed Dec 06, 2006 10:16 am
by dibyendrah
Wow Jaxolotl! Good to hear that

Code: Select all

SET NAMES 'utf8';
fixed your problem.

Actually, the above query does the three following things which I forgot to mention.

Code: Select all

SET character_set_client = 'utf8';
SET character_set_results = 'utf8';
SET character_set_connection = 'utf8';
OR

To make the character-set utf8 as default in your system, open the my.cnf or my.ini and edit.

Code: Select all

[mysql]
default-character-set=utf8
To test what's going on with character set, just do the following as jmut suggested :

Code: Select all

SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
Cheers,
Dibyendra

Posted: Wed Dec 06, 2006 11:11 am
by Jaxolotl
Hi dibyendrah,
about ini change

Code: Select all

[mysql]
default-character-set=utf8
I prefer

Code: Select all

SET NAMES 'utf8';
because some times you're not the host owner or can't access the ini files, or maybe running different kind of db tables for different programmers so the SQL way let's you control just your own script set.

by the way I read on the mySQL manual that utf8_unicode_ci is more accurated than utf8_general_ci but is still being developed son some characters aren't contanined on it; so I decide to use the general set instead of the unicode one.
http://dev.mysql.com/doc/refman/5.0/en/ ... -sets.html
Any word on favor or against that desition?

Posted: Wed Dec 06, 2006 11:13 pm
by dibyendrah
Yes, you are right Jaxolotl. In the case of shared host, we are able to so only the sql query and can't change the ini files. Anyway, now your problem is now solved. So, enjoy! :wink: