Annoying Charachter

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
duxbox
Forum Newbie
Posts: 11
Joined: Thu Jul 23, 2009 5:38 pm

Annoying Charachter

Post by duxbox »

Hi All,

I am working on a classified ads website that I converted from English to Farsi language so everything is now RTL. I keep seeing an annoying � character at the end of some ad's preview that I can't seem to be able to get rid of. I am attaching a screenshot so you can see exactly what I mean.

I'd appreciate your help in advance.
Attachments
ch.JPG
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Annoying Charachter

Post by requinix »

That means there is an invalid byte sequence: it's supposed to represent a character but it doesn't work in the character encoding you're using.

From context it looks like that character is supposed to be an ellipsis? How does that get added to the text?
duxbox
Forum Newbie
Posts: 11
Joined: Thu Jul 23, 2009 5:38 pm

Re: Annoying Charachter

Post by duxbox »

requinix wrote:That means there is an invalid byte sequence: it's supposed to represent a character but it doesn't work in the character encoding you're using.

From context it looks like that character is supposed to be an ellipsis? How does that get added to the text?

You're right. It's an ellipsis. Here is the code:

Code: Select all

			<?php 
			if($ad_preview_chars) 
			{ 
				echo "<span class='adpreview'>";
				$row['addesc'] = preg_replace("/\[\/?URL\]/", "", $row['addesc']);
				echo substr($row['addesc'],0,$ad_preview_chars);
				if (strlen($row['addesc'])>$ad_preview_chars) echo "...";
				echo "</span>";
			} 
			?>
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Annoying Charachter

Post by requinix »

Ah, no, that's not it: now that I'm on a proper monitor I can see that the ellipsis is intact in the output and the <?> is actually just before it. However that code does reveal the issue.

Code: Select all

echo substr($row['addesc'],0,$ad_preview_chars);
Many of the normal string functions are not suitable for multi-byte character strings. Like the ones you're using now. The functions only operate on the byte level, and if you're not extremely careful you're liable to cut off a character in the middle of its byte sequence. That'll result in <?>s because the bytes don't represent any known character so the browser had to put something in there.

Use mb_substr instead, which operates on logical characters instead of just their bytes:

Code: Select all

echo mb_substr($row['addesc'],0,$ad_preview_chars,'whatever character encoding you are using');
Note that after you do this the preview content will become about 3 times longer than it is now, but if you count it out you'll see you're actually getting the right number of characters.
duxbox
Forum Newbie
Posts: 11
Joined: Thu Jul 23, 2009 5:38 pm

Re: Annoying Charachter

Post by duxbox »

requinix wrote:Ah, no, that's not it: now that I'm on a proper monitor I can see that the ellipsis is intact in the output and the <?> is actually just before it. However that code does reveal the issue.

Code: Select all

echo substr($row['addesc'],0,$ad_preview_chars);
Many of the normal string functions are not suitable for multi-byte character strings. Like the ones you're using now. The functions only operate on the byte level, and if you're not extremely careful you're liable to cut off a character in the middle of its byte sequence. That'll result in <?>s because the bytes don't represent any known character so the browser had to put something in there.

Use mb_substr instead, which operates on logical characters instead of just their bytes:

Code: Select all

echo mb_substr($row['addesc'],0,$ad_preview_chars,'whatever character encoding you are using');
Note that after you do this the preview content will become about 3 times longer than it is now, but if you count it out you'll see you're actually getting the right number of characters.

Thank you so much for that :D I really appreciate it... it totally makes sense now! I changed the code to the following as you suggested and it worked like a charm.

Code: Select all

<?php 
			if($ad_preview_chars) 
			{ 
				echo "<span class='adpreview'>";
				$row['addesc'] = preg_replace("/\[\/?URL\]/", "", $row['addesc']);
				echo mb_substr($row['addesc'],0,$ad_preview_chars,'UTF-8');
				if (strlen($row['addesc'])>$ad_preview_chars) echo "...";
				echo "</span>";
			} 
			?>
As you can see in the new screenshot, the ad preview is now showing properly and I am not getting anymore of the �s in the ad's summary which is great! Though I can still see a couple of them at the end of some ad's title and locations which I don't think they are being produced by the above script.

After digging a little deeper I see the following PHP script in another file related to displaying a single ad (not an ad listing page).

Here is what it looks like:

Code: Select all

<?php
}
?>

<div class="adtitle">
<?php echo $ad['adtitle']; ?>
<?php 
$loc = "";
if($ad['area']) $loc = $ad['area'];
if($xcityid < 0) $loc .= ($loc ? ", " : "") . $ad['cityname'];
if($loc) echo " <span class=\"adarea\">($loc)</span>";
?>
</div>
Do you see anything here that could be causing the �s?
Attachments
ch2.JPG
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Annoying Charachter

Post by requinix »

Well, I'd ask where $ad is coming from and where it got its values. Maybe it came directly from a database? What does strlen() (and not mb_strlen()) say about the length of the problematic titles and locations?
duxbox
Forum Newbie
Posts: 11
Joined: Thu Jul 23, 2009 5:38 pm

Re: Annoying Charachter

Post by duxbox »

requinix wrote:Well, I'd ask where $ad is coming from and where it got its values. Maybe it came directly from a database? What does strlen() (and not mb_strlen()) say about the length of the problematic titles and locations?

$ad gets its value from below.

Code: Select all

	// Get the ad
	$sql = "SELECT a.*, ct.cityname as cityname, UNIX_TIMESTAMP(a.timestamp) AS timestamp, UNIX_TIMESTAMP(a.createdon) AS createdon, UNIX_TIMESTAMP(a.expireson) AS expireson, UNIX_TIMESTAMP(feat.featuredtill) AS featuredtill $xfieldsql
			FROM $t_ads a
				INNER JOIN $t_subcats scat ON scat.subcatid = a.subcatid
                INNER JOIN $t_cities ct ON a.cityid = ct.cityid
				LEFT OUTER JOIN $t_adxfields axf ON a.adid = axf.adid
				LEFT OUTER JOIN $t_featured feat ON a.adid = feat.adid AND feat.adtype = 'A'
			WHERE a.adid = $xadid
				AND $visibility_condn_admin";
	$ad = mysql_fetch_array(mysql_query($sql));

	$isevent = 0;
	if($sef_urls) $thisurl = "{$vbasedir}$xcityid/posts/$xcatid/$xsubcatid/{$xadid}_".RemoveBadURLChars($ad['adtitle']).".html";
	else $thisurl = "?$qs";

}


if (!$ad) 
{
	header("Location: $script_url/?view=main&cityid=$xcityid&lang=$xlang");
	exit;
}


if ($_POST['email'] && $_POST['mail'] && $ad['showemail'] == EMAIL_USEFORM)
{
	if ($image_verification && !$captcha->verify($_POST['captcha']))
	{
		$err = $lang['ERROR_IMAGE_VERIFICATION_FAILED'];
	}
	else
	{

		unset($_GET['mailed'],$_GET['mailerr'],$_GET['reported']);
		$qs = "";
		foreach ($_GET as $k=>$v) $qs .= "$k=$v&";
		$qs = substr($qs, 0, -1);
		$thisurl = "$script_url/?$qs";

		$thismail_header = file_get_contents("mailtemplates/contact_header.txt");
		$thismail_header = str_replace("{@SITENAME}", $site_name, $thismail_header);
		$thismail_header = str_replace("{@ADTITLE}", $ad['adtitle'], $thismail_header);
		$thismail_header = str_replace("{@ADURL}", $thisurl, $thismail_header);
		$thismail_header = str_replace("{@FROM}", $_POST['email'], $thismail_header);

		$thismail_footer = file_get_contents("mailtemplates/contact_footer.txt");
		$thismail_footer = str_replace("{@SITENAME}", $site_name, $thismail_footer);
		$thismail_footer = str_replace("{@ADTITLE}", $ad['adtitle'], $thismail_footer);
		$thismail_footer = str_replace("{@ADURL}", $thisurl, $thismail_footer);
		$thismail_footer = str_replace("{@FROM}", $_POST['email'], $thismail_footer);

		$mail = $thismail_header . "\n" .
				stripslashes($_POST['mail']) . "\n" .
				$thismail_footer;



Ad Title character length:

Code: Select all

			<b><?php echo $lang['POST_ADTITLE']; ?>:</b><br>
			<input name="adtitle" type="text" id="adtitle" size="100" maxlength="80" value="<?php echo $data['adtitle']; ?>">

Location character length:

Code: Select all

			<?php echo $lang['OR_SPECIFY']; ?>

			<input name="area" type="text" size="65" maxlength="50" value="<?php echo $data['area']; ?>" onKeyUp="javascript:if(this.form.arealist.selectedIndex!=<?php echo $other_index; ?>) this.form.arealist.selectedIndex=<?php echo $other_index; ?>;" <?php if($area_inlist) echo "disabled"; ?>>

			<?php
			}
			else
			{
			?>

			<input name="area" type="text" size="65" maxlength="50" value="<?php echo $data['area']; ?>">

			<?php
			}
			?>

and strlen for title:

Code: Select all

if ($_POST['do'] == "post")
{
	$data = $_POST;
	$data['area'] = $data['area']?$data['area']:$data['arealist'];
	foreach ($data as $k=>$v) if(!is_array($v)) $data[$k] = stripslashes($v);

	if(!$data['adtitle'])
	{
		$data['adtitle'] = substr($data['addesc'], 0, $generated_adtitle_length) . ((strlen($data['addesc']) > $generated_adtitle_length) ? $generated_adtitle_append : "");

		if(strpos($data['adtitle'], "\n") > 0) $data['adtitle'] = trim(substr($data['adtitle'], 0, strpos($data['adtitle'], "\n")));
	}

	if(!$data['addesc'] || (!$in_admin && !$data['email']))
		$err .= "&bull; $lang[ERROR_POST_FILL_ALL]<br>";
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Annoying Charachter

Post by requinix »

What do you get with "SHOW CREATE TABLE $t_ads"?
duxbox
Forum Newbie
Posts: 11
Joined: Thu Jul 23, 2009 5:38 pm

Re: Annoying Charachter

Post by duxbox »

requinix wrote:What do you get with "SHOW CREATE TABLE $t_ads"?
My phpMyadmin knowledge very limited but when I enter SHOW CREATE TABLE $t_ads in SQL query I get the following:

#1146 - Table 'databasename.$t_ads' doesn't exist

It looks like it's more complicated that I thought... It could be a character encoding mismatch with mySQL database or something since the database was originally latin1 and now my hosting has changed it to utf8.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Annoying Charachter

Post by requinix »

$t_ads is a PHP variable. If you execute the query in phpMyAdmin then you have to put the real table name in there.

Character encoding something, to be sure. Mismatch, possibly. What does one of the problematic ad's data look like in phpMyAdmin?
duxbox
Forum Newbie
Posts: 11
Joined: Thu Jul 23, 2009 5:38 pm

Re: Annoying Charachter

Post by duxbox »

requinix wrote:$t_ads is a PHP variable. If you execute the query in phpMyAdmin then you have to put the real table name in there.

Character encoding something, to be sure. Mismatch, possibly. What does one of the problematic ad's data look like in phpMyAdmin?
Thanks requinix!

As you can see none of my ads appear correctly in my database when browsing the tables in phpmyadmin. Character encoding is set to Unicode UTF-8 in my browser but nothing appears right in mySQL database. (see screenshot)

I also tried running the show table command for $adtable and got this:

The #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'CREAT TABLE $adtable' at line 1
Attachments
ch3.JPG
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Annoying Charachter

Post by requinix »

There's definitely a character encoding issue, but I'm not so sure it's causing the immediate problem of <?> characters.
duxbox wrote:I also tried running the show table command for $adtable and got this:

The #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'CREAT TABLE $adtable' at line 1
You misspelled "CREATE", and I think you missed the point of my
requinix wrote:$t_ads is a PHP variable. If you execute the query in phpMyAdmin then you have to put the real table name in there.
comment.
duxbox
Forum Newbie
Posts: 11
Joined: Thu Jul 23, 2009 5:38 pm

Re: Annoying Charachter

Post by duxbox »

requinix wrote:There's definitely a character encoding issue, but I'm not so sure it's causing the immediate problem of <?> characters.
duxbox wrote:I also tried running the show table command for $adtable and got this:

The #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'CREAT TABLE $adtable' at line 1
You misspelled "CREATE", and I think you missed the point of my
requinix wrote:$t_ads is a PHP variable. If you execute the query in phpMyAdmin then you have to put the real table name in there.
comment.

You're right. Here is what I am getting for the ad table:

Code: Select all

CREATE TABLE `adxfields` (
 `adid` int(10) unsigned NOT NULL DEFAULT '0',
 `f1` varchar(255) NOT NULL DEFAULT '',
 `f2` varchar(255) NOT NULL DEFAULT '',
 `f3` varchar(255) NOT NULL DEFAULT '',
 `f4` varchar(255) NOT NULL DEFAULT '',
 `f5` varchar(255) NOT NULL DEFAULT '',
 `f6` varchar(255) NOT NULL DEFAULT '',
 `f7` varchar(255) NOT NULL DEFAULT '',
 `f8` varchar(255) NOT NULL DEFAULT '',
 `f9` varchar(255) NOT NULL DEFAULT '',
 `f10` varchar(255) NOT NULL DEFAULT '',
 `f11` varchar(255) NOT NULL DEFAULT '',
 `f12` varchar(255) NOT NULL DEFAULT '',
 `f13` varchar(255) NOT NULL DEFAULT '',
 `f14` varchar(255) NOT NULL DEFAULT '',
 `f15` varchar(255) NOT NULL DEFAULT '',
 KEY `adid` (`adid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1


and this is for the 'ad list' table:

Code: Select all

CREATE TABLE `ads` (
 `adid` int(10) unsigned NOT NULL AUTO_INCREMENT,
 `adtitle` varchar(100) NOT NULL DEFAULT '',
 `addesc` longtext NOT NULL,
 `area` varchar(50) NOT NULL DEFAULT '',
 `email` varchar(50) NOT NULL DEFAULT '',
 `showemail` enum('0','1','2') NOT NULL DEFAULT '0',
 `password` varchar(50) NOT NULL DEFAULT '',
 `code` varchar(35) NOT NULL DEFAULT '',
 `cityid` smallint(5) unsigned NOT NULL DEFAULT '0',
 `subcatid` smallint(5) unsigned NOT NULL DEFAULT '0',
 `price` decimal(10,2) NOT NULL DEFAULT '0.00',
 `othercontactok` enum('0','1') NOT NULL DEFAULT '0',
 `hits` int(10) unsigned NOT NULL DEFAULT '0',
 `ip` varchar(15) NOT NULL DEFAULT '',
 `verified` enum('0','1') NOT NULL DEFAULT '0',
 `abused` int(10) unsigned NOT NULL DEFAULT '0',
 `enabled` enum('0','1') NOT NULL DEFAULT '0',
 `createdon` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
 `expireson` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
 `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
 PRIMARY KEY (`adid`),
 KEY `subcatid` (`subcatid`),
 KEY `cityid` (`cityid`),
 KEY `verified` (`verified`),
 KEY `enabled` (`enabled`)
) ENGINE=MyISAM AUTO_INCREMENT=28308 DEFAULT CHARSET=latin1
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Annoying Charachter

Post by requinix »

So as you may have noticed the tables are using the latin1 charset. Probably other tables too. That's not good - they all need to be UTF-8. You'll also need to check server variables to make sure default database and table charsets, along with things like the connection charset, are all using UTF-8 too. Without all the steps along the way using the right one your data can get messed up.

The actual problem, I think, is that the ads.area (as an example) is using a VARCHAR(50). In English that translates to "50 characters in the Latin1 encoding", and since latin1 has one byte per character that also translates to "50 bytes". However you're stuffing UTF-8 data in there, and your data probably has about 3 bytes per character: that means you can only store about 50 / 3 = 16.7 characters. Anything more will be lost: trying to store 17 characters would get 16 and then only part of the 17th. That "only part of" means you get <?>s because it's the exactly thing happening in MySQL as happened in your code: characters got cut off.

The good news is that your system works despite all the character encoding badness, so I think the minimum you need to do to get everything working again is extend the length of the affected columns (remembering that they need to be about 3x longer than you'd expect) and then re-enter the corrupted data (because what you've lost is not recoverable).
duxbox
Forum Newbie
Posts: 11
Joined: Thu Jul 23, 2009 5:38 pm

Re: Annoying Charachter

Post by duxbox »

requinix wrote:So as you may have noticed the tables are using the latin1 charset. Probably other tables too. That's not good - they all need to be UTF-8. You'll also need to check server variables to make sure default database and table charsets, along with things like the connection charset, are all using UTF-8 too. Without all the steps along the way using the right one your data can get messed up.

The actual problem, I think, is that the ads.area (as an example) is using a VARCHAR(50). In English that translates to "50 characters in the Latin1 encoding", and since latin1 has one byte per character that also translates to "50 bytes". However you're stuffing UTF-8 data in there, and your data probably has about 3 bytes per character: that means you can only store about 50 / 3 = 16.7 characters. Anything more will be lost: trying to store 17 characters would get 16 and then only part of the 17th. That "only part of" means you get <?>s because it's the exactly thing happening in MySQL as happened in your code: characters got cut off.

The good news is that your system works despite all the character encoding badness, so I think the minimum you need to do to get everything working again is extend the length of the affected columns (remembering that they need to be about 3x longer than you'd expect) and then re-enter the corrupted data (because what you've lost is not recoverable).

You are a genius requinix!!! I very much appreciate your help figuring this out for me and also explaining the issue so clearly.

Thank You :D
Post Reply